[PDF] ANDROMEDA: An FPGA Based RISC-V MPSoC Exploration Framework

Abstract

With the growing demands of consumer electronic products, the computational requirements are increasing exponentially. Due to the applications' computational needs, the computer architects are trying to pack as many cores as possible on a single die for accelerated execution of the application program codes. In a multiprocessor system-on-chip (MPSoC), striking a balance among the number of cores, memory subsystems, and network-on-chip parameters is essential to attain the desired performance. In this paper, we present ANDROMEDA, a RISC-V based framework that allows us to explore the different configurations of an MPSoC and observe the performance penalties and gains. We emulate the various configurations of MPSoC on the Synopsys HAPS-80D Dual FPGA platform. Using STREAM, matrix multiply, and N-body simulations as benchmarks, we demonstrate our framework's efficacy in quickly identifying the right parameters for efficient execution of these benchmarks.

Full PDF

PPreprint - Accepted in VLSI Design 2021

ANDROMEDA: An FPGA Based RISC-V MPSoCExploration Framework

Farhad Merchant, Dominik Sisejkovic, Lennart M. Reimann,Kirthihan Yasotharan, Thomas Grass, Rainer Leupers

Institute for Communication Technologies and Embedded Systems, RWTH Aachen University, Germany { farhad.merchant, sisejkovic, reimannl, yasotharan, grass, leupers } @ice.rwth-aachen.de Abstract —With the growing demands of consumer electronicproducts, the computational requirements are increasing expo-nentially. Due to the applications’ computational needs, the com-puter architects are trying to pack as many cores as possible ona single die for accelerated execution of the application programcodes. In a multiprocessor system-on-chip (MPSoC), striking abalance among the number of cores, memory subsystems, andnetwork-on-chip parameters is essential to attain the desiredperformance. In this paper, we present

ANDROMEDA , a RISC-Vbased framework that allows us to explore the different conﬁg-urations of an MPSoC and observe the performance penaltiesand gains. We emulate the various conﬁgurations of MPSoC onthe Synopsys HAPS-80D Dual FPGA platform. Using STREAM,matrix multiply, and N-body simulations as benchmarks, wedemonstrate our framework’s efﬁcacy in quickly identifying theright parameters for efﬁcient execution of these benchmarks.

Index Terms —design space exploration, multiprocessor system-on-chip, RISC-V, performance tuning

I. I

NTRODUCTION

Multiprocessor system-on-chips (MPSoCs) are the commoncomputing substrate for application domains such as automo-tive, and internet-of-things (IoT) [1] [2]. Recently, MPSoCsare heavily involved in service-oriented architectures due totheir capabilities to offer an order of speed-up over the state-of-the-art [3]. The applications executed on the MPSoCs in thesedomains are compute and communication-intensive, requiringcomplex platforms with deep memory hierarchy and networkinfrastructure for reduced application run-time at constrainedenergy and area footprints [4]. A right balance between systemparameters is required to attain the desired performance.Exploring system-level parameters for run-time perfor-mance improvement is an extensively studied topic in theliterature [5]. However, most of the literature solutions areincomplete and sometimes rely on a complex set of parametersthat can be simpliﬁed. Also, for many solutions, the rapid ﬁeld-programmable gate array (FPGA) prototyping is intricate dueto complexities involved in the intermediate tools, since thetools have non-standardized tool-interfaces.Recently, an open instruction-set architecture called RISC-Vhas revolutionized the system design aspects due to its ﬂexibil-ity [6]. Due to the momentum gained by the adoption of RISC-V, there is an increasing demand for RISC-V based systemdesign and prototyping. While there have been several attemptsto design and develop efﬁcient single-core RISC-V computeplatforms, there are a handful of MPSoCs [7] [8]. There is anincreasing need for the frameworks to support RISC-V based MPSoC exploration to enable the system designers to developefﬁcient platforms for the next-generation computing systems.In this paper, we present

ANDROMEDA , a uniﬁed frame-work that facilitates the design space exploration of FPGAbased MPSoCs. The ANDROMEDA framework helps in iden-tifying the bottlenecks in application execution for system-level optimizations. The major contributions in this paper areas follows: • A light-weight network-on-chip (NoC) for clusteredRISC-V MPSoC platform development. • ANDROMEDA, an FPGA-based MPSoC framework forearly-stage exploration and application execution bottle-neck identiﬁcation. • Evaluation of ANDROMEDA using STREAM, matrixmultiply, and N-body simulations, and identiﬁcation ofbottlenecks for the benchmarks.The rest of the paper is organised as follows: In Section II,we discuss background and the literature. A light-weight NoCimplementation is discussed in Section III along with theproposed ANDROMEDA framework. The experimental setupand results are depicted in Section IV. We conclude our workin Section V.II. B

ACKGROUND AND R ELATED W ORK

A. Background1) RocketChip Generator:

The Rocket Chip is an open-source tool to instantiate the Rocket Core, and synthesizablesystem-on-chip (SoC). The Rocket Chip generator tool isimplemented in the

Chisel hardware construction language.The major advantage of the RocketChip generator is its conﬁg-urability. The parameters such as, number of cores, cache size,cache placement policies, arithmetic units, pipeline stages,memory management unit, hardware performance counters,interconnect, and others can be customised to attain the SoC.

2) Rocket core:

Rocket is a 5-stage single-issue in-orderCPU which implements the RISC-V ISA. It supports bothRV32G and RV64G. Rocket features a non-blocking datacache, a branch predictor, an MMU, and an FPU. Rocket canbe conﬁgured to meet individual requirements. Options includesupported ISA extensions (e.g., M, A, F, D) and cache sizes.

3) Synopsys HAPS-80D Dual:

Synopsys high-performanceASIC prototyping systems (HAPS) is a family of FPGA basedsystems for prototyping ASICs and SoCs. HAPS solutionsare shipped with prototyping hardware and the supporting a r X i v : . [ c s . A R ] J a n ig. 1: Synopsys HAPS-80D Dualsoftware tools, to enable faster system validation and earliersoftware development.Fig. 1 shows the overall system architecture of HAPS-80D Dual platform. The platform consists of two UltraScaleXCVU440 FPGAs. The FPGAs are connected via severalhigh-speed serial links. HAPS-80D Dual is equipped with 13low-skew clock networks. The global clocks are labeled from GCLK0 to GCLK2 . GCLK0 is a ﬁxed 100 MHz clock reservedfor system function.

GCLK1 - GCLK12 are available for userdesigns, and in most cases, are sourced from the on-boardphase-lock-loops (PLLs). The HAPS-80D Dual contains anon-board DDR4 DRAM module with 8GB capacity. It can beused as memory in user designs, or as a large sample memoryfor signal debugging. HAPS-80D Dual features various I/Ooptions, such as GPIO, PMOD, JTAG and UART. The HAPS-80D Dual system connects to a host computer via a high-speed USB-C cable. The Synopsys UMRBus protocol is usedfor both conﬁguring the system with the user design, and forsubsequent debugging.

B. Related Work

There has been a plethora of MPSoC works in the literaturefocusing on early stage design space exploration [5]. Some ofthe early works focused on design automation for custom gen-eration of MPSoCs, while a few of them focused on industrial-grade design customization. Daedalus framework presentedin [9] and [10] introduces a system-level exploration, program-ming and prototyping framework. The Daedalus frameworktakes a sequential application as an input and translates itinto an MPSoC implementation on FPGA. The Daedalusframework considers only dataﬂow dominated applications.The system-on-chip environment (SCE) is an interactiveframework presented in [11]. The SCE frameworks acceptshigh-level speciﬁcations as an input and translates the spec-iﬁcations into hardware/software implementation. The majorbottleneck in the SCE design ﬂow is the initial system-levelspeciﬁcations for the desired hardware/software implementa-tion. The SystemCoDesigner framework presented in [12] hassimilar limitations.

Fig. 2: The RocketChip mesh NoC: numbers at top left cornersof nodes indicate node IDs ; numbers inside nodes indicate core(hart) IDs . ExampleRocketSystem

FIFO FIFO F I F O F I F O MEM MMIO

Node k C k4 C k3 C k2 C k1 C k0 BRAM

SmartConnect

TXRX E A S T TXRXSOUTHTX RXNORTHTXRX W E S T AXI4-Stream FIFO

Fig. 3: Internal structure of a node in NOC BASEOur proposed framework, ANDROMEDA, is an over-simplistic framework for RISC-V based MPSoC prototypingon FPGA. The input to the framework are several high-levelsystem parameters that result in an MPSoC implementationon Synopsys HAPS-80D Dual platform.III. N

ETWORK - ON -C HIP AND A NDROMEDA

A base system, NOC BASE, was designed, which was usedas a starting point for design space exploration. Table I showsthe parameters of NOC BASE, together with other conﬁgu-rations, which are explored in later section. In the following,the hardware architecture of NOC BASE is covered.The system consists of 16 nodes, N ...N , interconnectedin a × N k , a RocketChip SoC with ( n + 1) cores, C k ...C kn (hart 0 ... hart n ) is located. Every RocketChip instance iscreated from the same generated ExampleRocketSystem .Cores C k ...C k, ( n − are (general-purpose) processing cores,while C kn =: S k is a small RocketCore, which is alsoTABLE I: NoC conﬁgurations

Name Nodes Proc. cores / node Router Flow-control

NOC BASE 16 4 Sk Store-and-forwardNOC SW 16 4 AXIS-Switch Store-and-forwardNOC SW C 16 4 AXIS-Switch Cut-through ig. 4: ANDROMEDA framework for FPGA based MPSoCexplorationcoherently interconnected with the processing cores. The smallcore’s role is to act as a co-processor to handle all network-related tasks, like routing and switching. Though using ageneral-purpose core as a software router is likely not theoptimal solution, this route was initially chosen to get to aworking prototype system as quickly as possible. Furthermore,without the RocketChip generator, the design of this particularapproach would have been more involved, too.The memory model of the system follows the distributedmemory paradigm. Each node contains BRAM as main mem-ory (used for data and instructions), which is only accessibleby the node’s local cores, and not from other nodes. The sizeof the BRAM is conﬁgurable in Vivado and was set to be 256kB for all nodes, except N , which has 2 MB. In total, thecomplete system can ﬁt up to 5.75 MB of on-chip data. N additionally incorporates an AXI UART Lite to interface witha serial console on the host computer. Fig. 3 shows the internalstructure of a node. We explain usage of NoC and RocketChipto incorporate a distributed memory system, ANDROMEDAfor system level exploration (see Fig. 4).ANDROMEDA is a simple framework that consisting ofinput system parameters that are used for system conﬁguration.The parameters are the number of cores, cache size (seeconﬁguration in Table II), and coherency, ﬂoating-point unit(FPU) and its pipeline stages, memory management unit,hardware performance counters, and interconnect network andits parameters.In ANDROMEDA, a designer sets the parameters thatautomatically generate the desired multicore or manycore sys-tem. The generated system is later prototyped on the HAPS-80D Dual FPGA prototyping platform, where the system isevaluated using benchmarks written in the C programminglanguage. If the performance attained is not satisfactory, thenthe parameters are calibrated through the designer interven-tion to arrive at a satisfactory performance. The proposedANDROMEDA framework enables rapid prototyping as thesystem parameters are at a higher level of abstraction than inthe literature. At this stage, automatic tuning of the parametersis out of the scope of this work.IV. E XPERIMENTAL S ETUP AND R ESULTS

A. Benchmarks

To evaluate the performance of ANDROMEDA, threebenchmarks written in C have been applied. In the following,TABLE II: Data cache conﬁgurations

Name nSets nWays Size (KiB)

BASE 64 4 16

C-64-8 64 8 32C-64-16 64 16 64 the selected benchmarks are brieﬂy explained.

1) STREAM:

The STREAM benchmark [13] is used tomeasure the sustainable memory bandwidth of computer sys-tems. Though initially targeted for high-performance comput-ing systems, it can also be applied on personal or embeddedcomputers. STREAM executes four kernels on contiguousarrays a , b and c , and determines from the execution timethe resulting bandwidth. The kernels are: COPY ( a[i] =b[i] ), SCALE ( a[i] = q*b[i] ), ADD ( a[i] = b[i]+ c[i] ), and TRIAD ( a[i] = b[i] + q*c[i] ).To minimize cache effects, the sizes of the arrays should,in general, be larger than the largest available cache. The datatype of the arrays can be selected in the benchmark, with thedefault type being double , which was also used in this work.

2) Matrix Multiplication:

As matrix multiplication (Mat-mul) is an operation found in many applications, a Matmulbenchmark was also implemented for RocketChip. It multi-plies two matrices, A ( N × K ) and B ( K × M ) , and writesthe result into a third matrix, C ( N × M ) . Matrix A and C arestored in row-major format, while matrix B is stored column-major . All matrices contain double-precision values.

3) N-body Simulation:

The N-body simulation (N-body)numerically solves the

N-body problem , which is a classicproblem from orbital mechanics. Given are N bodies withmass m i , initial position (cid:126)r i ( t = 0) and initial velocity (cid:126)v i ( t = 0) . The bodies exert forces on each other accordingto Newton’s gravitation law. The net force exerted on body i is then given by: (cid:126)F i ( t ) = N − (cid:88) j =0 j (cid:54) = i Gm i m j (cid:107) (cid:126)r i ( t ) − (cid:126)r j ( t ) (cid:107) ( (cid:126)r i ( t ) − (cid:126)r j ( t )) . (1)The goal is to ﬁnd the positions and velocities of each of thebodies after a time t . The simulation is performed in several timesteps . In each timestep, the net force exerted on each bodyis calculated according to equation 1. From the net force, theacceleration experienced by that particular body is computed.Then, based on the acceleration, the new position and velocitycan be derived. The computational complexity of an N-bodysimulation is O ( N ) . All values (masses, positions, velocities)are stored in single-precision ﬂoating-point format ( float ).The storage requirement scales with O ( N ) . B. Parallelization Techniques

For parallelizing the benchmarks, the worksharing principlewas applied. In all three benchmarks (STREAM, Matmul, N-body), the work can be distributed evenly among the cores.For bare-metal applications, some manual work is requiredto realize the worksharing. In general, each core calculatesthe start index of its designated partition based on its hart ID .After a parallel region, synchronization may be necessary. The riscv-tests provide a barrier() function, which canbe used to synchronize the cores.Matmul can be parallelized by dividing matrix A into sub-matrices A i . Each core i then calculates the correspondingpartition C i of matrix C by multiplying A i with B .-body is parallelized by partitioning the arrays contain-ing the body data evenly among the cores. Each core thencomputes the new positions and velocities of the bodies in itsassigned partition. At the end of each timestep, a barrier isrequired, to ensure that each core has updated its body valuesbefore the next timestep starts. C. Evaluation

This section evaluates the ANDROMEDA presented insection III for different parameters and evaluates single node(BASE) resource utilization and benchmark performance.BASE32 represents single node with 32 cores.

1) Number of cores:a) FPGA utilization:

Table III shows the FPGA resourceutilization of BASE for core counts from one to 32. Synthesisfor all versions was constrained to 10 MHz. As can be seen intable III, the LUT counts seem to suggest sub-linear scaling.However, this is primarily caused by modules inside the SoCthat are independent of the number of cores. A linear increasecan be observed for the number of BRAM, which are usedfor the L1 data and instruction caches, and DSP slices, whichare, for example, used to implement parts of the ﬂoating-pointunit (FPU). The 32 core system utilizes around 40% of theavailable LUTs. b) STREAM:

Figure 5a shows the memory bandwidthmeasurements made from the STREAM benchmark. It wasrun on BASE32 with an array size of 128,000. The theoreticalpeak memory bandwidth delivered by the outer memory bus(i.e., SmartConnect) is 8 bytes per cycle, as the AXI data busis 64 bit wide. As Fig. 5a suggests, with increasing core count,a larger fraction of the available memory bus bandwidth couldbe utilized. A linear scaling can be seen for one to four cores.Beginning at eight cores, however, the sustainable bandwidthbegins to saturate at about 1.41 bytes/cycle for the COPYkernel, well below the theoretical maximum. c) Matmul:

Matmul with N=K=M=128 was run onBASE32. Fig. 5b shows the average execution times (in cycles)for varying core counts. Until four cores, the speedup isapproximately equal to the number cores. At eight cores, thespeedup drops to about 6 (instead of ideally 8). The situationworsens for even larger core counts. These results again showthe limitation of shared memory architectures. d) N-body:

Fig. 5c plots the execution times (in cycles)of an N-body simulation with N=4,096 bodies run on BASE32.As can be seen from Fig. 5c, the cycle count decreases withincreasing core count, as expected. Up until 16 cores, close-to-ideal speedup is achieved. For 32 cores, however, the speedupis only 20 (instead of ideally 32). At high core counts, theshared memory bus can still limit the performance.TABLE III: FPGA utilization of BASE (f = 10 MHz)

Number of cores LUTs BRAM DSP × ) 12 352 69,053 ( × . ) 24 704 131,188 ( × . ) 48 1408 255,000 ( × . ) 96 28016 499,931 ( × . ) 192 56032 998,145 ( × . ) 384 1120 Overall it can be concluded that adding more cores, ingeneral, improves performance. However, if the number ofcores sharing a bus gets too large ( >

2) Cache subsystem:a) FPGA utilization:

The synthesis results for the cacheconﬁgurations from Table II are shown in Table IV. Thepercentages indicate the relative increase in the respectiveresource count compared to BASE. As expected, the usages ofboth LUTs and BRAMs of the L1 data cache grow as the num-ber of ways is doubled. The last column, ”Total LUTs”, refersto the LUT usage of the complete

ExampleRocketSystem of a particular conﬁguration. It can be seen that the increaseof LUTs is reasonable for both cache conﬁgurations.To compare the different L1 data cache sizes, Matmul andN-body were run on the three conﬁgurations. b) Matmul:

Fig. 6a shows the execution time (in cycles)for a Matmul with N=K=M=128 run on the three conﬁgu-rations from Table IV. It can be seen that in general, theperformance improves for larger cache sizes. For a cache sizeof 64 kb (C-64-16), the execution time drops by over 8% forall core counts compared to BASE. This result is achievedwith an LUT overhead of less than 4%, but with quadruplethe amount of BRAMs per L1 data cache. c) N-body:

Fig. 6b shows the execution time (in cycles)for an N-body simulation with N=2,048 bodies. As for Mat-mul, it can be observed that the execution time reduces whenmore cache is available. A larger cache means that more bodiescan be stored in the local memories.The results show that larger caches indeed improve per-formance, though depending on the particular application’sworking set size and access patterns. The total area overheadis mainly devoted to on-chip memory.

D. Network-on-Chip

The NoC presented in Section III has been designed to eval-uate the beneﬁts of distributing cores across multiple nodes,instead of having the cores share a single bus. This sectionevaluates the explored NoC conﬁgurations from Table I.

1) FPGA synthesis results:

Table V summarizes the totalresource usage of the three explored conﬁgurations. BothNOC SW and NOC SW C reduce the overall utilization com-pared to NOC BASE. NOC SW and NOC SW C, both usehardware switches instead of a software router. NOC SW Cuses a switch with cut-through switching. The primary sav-ings in terms of LUTs and BRAMs are the consequence ofTABLE IV: FPGA utilization of cache conﬁgs (f = 50 MHz)

Conﬁg L1 Data Cache Total LUTsLUTs BRAM

BASE 3065 4 131026C-64-8 3579 ( +16 . ) 8 ( +100% ) 133271 ( +1 . )C-64-16 4248 ( +38 . ) 16 ( +300% ) 136078 ( +3 . ) a) (b) (c) Fig. 5: (a) STREAM benchmark run on BASE32, size = 128,000 doubles, (b) Matmul (N=K=M=128) run on BASE32, and(c) N-body simulation with N=4,096 bodies and 10 timesteps run on BASE32. (a) (b)

Fig. 6: (a), Matmul (N=K=M=128) run on BASE, C-64-8 andC-64-16 for core counts 1, 2 and 4, and (b) N-body simulation(N=2,048, 10 timesteps) run on BASE, C-64-8 and C-64-16for core counts 1, 2 and 4.removing three AXI4-Stream memory-mapped FIFOs fromeach node. LUTs can be saved because the AXI4-StreamData FIFOs, added in NOC SW and NOC SW C, requireless logic as they do not come with an MMIO interface.

2) Benchmarks:

The performance and scaling of the NoC isanalyzed using the previously introduced Matmul and N-bodybenchmarks. a) Matmul:

For parallelizing Matmul on the NoC, theprinciples from the shared memory parallel version presentedearlier were extended to multiple nodes. Now, each node N k is assigned a partition A k . The partitions are distributedby N using the noc_scatter() function. Further, eachnode needs the complete matrix B to compute its respectivepartition C k of matrix C . B is distributed by N using the noc_bcast() function. The calculated partitions C k areﬁnally collected by N with the noc_gather() function.Fig. 7a shows the execution times (in cycles) for a Matmulwith N=256, K=128 and M=32, run on different node/coreTABLE V: FPGA utilization of NoC conﬁgs (f = 5 MHz) Conﬁg LUTs BRAM

NOC BASE 2,011,152 2,384NOC SW 1,905,056 ( − . ) 2,368 ( − . )NOC SW C 1,901,234 ( − . ) 2,368 ( − . ) TABLE VI: FPGA utilization of NoC conﬁgs (f = 5 MHz)

Conﬁg LUTs

NOC BASE 52,992NOC SW 39,016 ( − . )NOC SW C 35,880 ( − . ) arrangements. It can be observed that the computation timedecreases with increasing aggregate core count. This scalingeven holds for the maximum core count of 64. Recall, that inthe shared memory case, the cycle count began to increasealready for core counts above eight. Fig. 7b compares theexecution times for a Matmul (N=512, K=128, M=32) betweentwo 16 core arrangements, (16,1) and (4,4), and a 16-coreSMP system. It can be seen that even though the 16-core SMPsystem suffers from the scaling bottleneck as seen in Fig. 5b,it still performs better than the two compared node/coreconﬁgurations, which are dominated by the communicationoverheads.In conclusion, the naive parallel Matmul is not well suitedfor execution on a distributed memory system, where com-munication dominates computation. Though exploiting intra-node parallelism improves performance in general, the gainsare quickly overshadowed by the communication overhead, asmore nodes are introduced. Lastly, it must be again stated thatthe potential of the NoC designed in this work is likely limitedby the small cores S k . b) N-body: As the N-body simulation is a computation-ally intensive application, it appears to be a good candidate tobeneﬁt from a distributed system. The NoC parallel versionfollows similar principles as the shared memory version.Fig. 7c shows the execution times (in cycles) for an N-bodysimulation with N=4,096. The speedup values compared to thesingle core version from Fig. 5c are indicated at the top ofeach bar. It can be immediately noticed that the computationto communication ratio is considerably larger than for Matmul.Further, the speedup is superlinear up until arrangement (4,2)compared to the speedups of the SMP version shown inFig. 5c.Fig. 8 compares N-body on NOC SW C with BASE32 forincreasing body numbers. The performance is measured insingle-precision FLOPs/cycles. In Fig. 8a, the performanceis compared for 16 cores. It can be seen that the 16-coreSMP system performs better for a smaller number of bodies( < a) (b) (c) Fig. 7: (a) Matmul (N=256, K=128, M=32) run on NOC SW C. (b) Comparison of Matmul (N=512, K=128, M=32) run onNOC SW C and 16-core BASE32, and (c) N-body simulation (N=4,096) run on NOC SW C. The values at the top of eachbar indicate the speedup compared to the single core version from Fig. 5c.performance of the SMP system drops signiﬁcantly due to thefact that the working sets get larger than the available L1 datacaches, resulting in increased bus pressure. The distributedcomputation, on the other hand, does not experience thisbottleneck for the same number of bodies. (a) 16 cores (b) 32 cores

Fig. 8: Comparison of N-body between NOC SW C and twoSMP systems (BASE32)In summary, it was shown that the beneﬁt of employinga distributed system largely depends on the computation tocommunication ratio . The naive Matmul does not beneﬁt muchfrom distributing the workload across multiple nodes; in fact,the performance worsens when more than eight nodes areused, due to the heavy communication requirements. The N-body simulation can proﬁt from the NoC for large N, as thecomputation greatly dominates the communication.Further, it was shown that the hybrid approach, i.e., havingmultiple cores inside each node, can yield an additionalperformance gain, as intra-node parallelism can be exploited.The other beneﬁt of hybrid systems is that less inter-nodecommunication is required with the same total core count (e.g.,(4,4) vs. (16,1)). V. C

ONCLUSION

We presented a light-weight NoC for distributed memoryMPSoC. Later, we presented ANDROMEDA, a RISC-V MP-SoC exploration framework that incorporated the light-weightNoC. The generated RISC-V MPSoCs are prototyped onthe Synopsys HAPS-80D Dual. The experimental evaluationdemonstrated that the ANDROMEDA framework is easy touse for early-stage system prototyping. The user analyses of the benchmarks and applications help to identify the perfor-mance bottleneck in the NoC-based MPSoCs. In the future, weplan to extend support for other FPGA platforms and focus onRISC-V customization for further run-time reduction.R

EFERENCES[1] C. E. Salloum et al. , “The across mpsoc – a new generation of multi-core processors designed for safety-critical embedded systems,” in , 2012, pp. 105–113.[2] J. Zhou et al. , “Security-critical energy-aware task scheduling forheterogeneous real-time mpsocs in iot,”

IEEE Transactions on ServicesComputing , vol. 13, no. 4, pp. 745–758, 2020.[3] C. Wang et al. , “Service-oriented architecture on fpga-based mpsoc,”

IEEE TPDS , vol. 28, no. 10, pp. 2993–3006, 2017.[4] A. Kumar et al. , “An fpga design ﬂow for reconﬁgurable network-basedmulti-processor systems on chip,” in , 2007, pp. 1–6.[5] A. Gerstlauer et al. , “Electronic system-level synthesis methodologies,”

IEEE TCAD , vol. 28, no. 10, pp. 1517–1530, 2009.[6] A. Waterman et al. , The RISC-V Instruction Set Manual, VolumeII: Privileged Architecture, Document Version 20190608-Priv-MSU-Ratiﬁed , RISC-V Foundation, June 2019.[7] A. Kamaleldin et al. , “Modular memory system for RISC-v basedMPSoCs on xilinx FPGAs,” in . IEEE,pp. 68–73. [Online]. Available: https://ieeexplore.ieee.org/document/8906519/[8] F. Zaruba et al. , “The cost of application-class processing: Energy andperformance analysis of a linux-ready 1.7ghz 64bit risc-v core in 22nmfdsoi technology,”

IEEE TVLSI , vol. 27, no. 11, pp. 2629–2640, 2019.[9] M. Thompson et al. , “A framework for rapid system-level exploration,synthesis, and programming of multimedia mp-socs,” in , 2007, pp. 9–14.[10] H. Nikolov et al. , “Daedalus: Toward composable multimedia mp-socdesign,” in , 2008, pp. 574–579.[11] R. D¨omer et al. , “System-on-chip environment: A specc-basedframework for heterogeneous mpsoc design,”

EURASIP J. Embed.Syst. , vol. 2008, 2008. [Online]. Available: https://doi.org/10.1155/2008/647953[12] J. Keinert et al. , “Systemcodesigner—an automatic esl synthesisapproach by design space exploration and behavioral synthesis forstreaming applications,”