A distributed memory, local configuration technique for re-configurable logic designs
AA distributed memory, local configuration technique forre-configurable logic designs
Alexander E. Beasley ∗ Unconventional Computing Laboratory, UWE, Bristol, UK
March 25, 2020
Abstract
The use and location of memory in integrated circuits plays a key factor in their performance. Memoryrequires large physical area, access times limit overall system performance and connectivity can result inlarge fan-out. Modern FPGA systems and ASICs contain an area of memory used to set the operationof the device from a series of commands set by a host. Implementing these settings registers requiresa level of care otherwise the resulting implementation can result in a number of large fan-out nets thatconsume valuable resources complicating the placement of timing critical pathways. This paper presentsan architecture for implementing and programming these settings registers in a distributed method acrossan FPGA and how the presented architecture works in both clock-domain crossing and dynamic partialre-configuration applications. The design is compared to that of a ‘global’ settings register architecture.We implement the architectures using Intel FPGAs Quartus Prime software targeting an Intel FPGACyclone V. It is shown that the distributed memory architecture has a smaller resource cost (as small as25% of the ALMs and 20% of the registers) compared to the global memory architectures.
The use of memory, memory accessing, and memory mapping techniques has a large impact on systemperformance [38, 8, 34]. Efficient mapping techniques, reduction in communication overhead and the useof distributed memories can vastly increase the systems overall performance, particularly for intensive taskssuch as loops and scalable graph operations [9, 22, 26, 6, 27]. The implementation of memories inside anembedded system comes with many research possibilities. Memory technologies are becoming denser andfaster, allowing for higher density memory to be implemented close to its point of use. Despite this, memorystill requires large, physical space.Distributed memory, where the memory is close to the point at which it is used offers huge benefits, solong as the memories are kept coherent where necessary [19, 21, 13].Integrated circuits often require memory to store user defined settings that control the mode of operation.Such examples could be the sample rate or resolution of a ADC; the applied phase shift of an RF phaseshifter; the gain of a variable gain amplifier and so on.Field Programmable Gate Arrays (FPGAs) provide a flexible platform for designers to fabricate seeminglyendless weird and wonderful systems. Quite often designers wish to make parameterisable systems wheretheir operation can be controlled by based on a number of settings. One way to achieve this is by use ofparameters [2] (or generics — VHDL [3]) that can be set at compile time. Parameters (generics) are avery powerful tool available in hardware descriptive languages, to create re-useable code. However, eachcombination of settings must be compiled separately, introducing a large amount of processing overhead andleading to a separate image file per configuration. Alternatively, designers can implement an area of theFPGA as an array of registers in which settings can be stored and propagated across the design. Theseregisters can be programmed by means of a connection with a host system (typically USB in a modern ∗ corresponding author: Alexander Beasley, [email protected] a r X i v : . [ c s . A R ] M a r ystem). Implementing these features in a device create demand on resources, reducing the overall resourcesavailable to be used as functional logic.FPGA place and route stages are complicated procedures, attempting to locate resources as close aspossible to reduce routing complication and net delay [14, 40, 15, 16]. As the resources are fixed, this oftenresults in trade-offs between quality of the fitter result and run-time of the fitter [29]. In addition, largefan out nets often take priority during the ‘fitting’ stage of an FPGAs compilation. Synthesis tools oftenattempt to insert extra resources or promote high fan-out nets to the clocking nets [1, 10]; reducing theavailable resources for timing critical pathways. This leads to more complicated designs that suffer frombottle-necking, manifesting itself as a reduction in the maximum operating frequency of a design.FPGAs have a large amount of memory distributed throughout the device. This memory neatly lendsitself for tasks where distributed memory, close to the point of use, such as loop operations and array intensiveoperations [31]. By extension, we can use the embedded memory blocks to create the sets of registers usedto set up and control the FPGA. Distributing the settings across the FPGA to their point of use helps toreduce the required routing resources, limit the high fan-out nets and improve timing closure.In this paper we explore how a typical ‘global’ register map expands with the number of required settingsfor a design and the width of each of these settings. The global register map is connected to modules in whicha varying number of entries in the global register map are used to help us model the resource requirementswhen fanning out the register map. A second architecture is presented that removes the global register mapand distributes the settings across the design to the points at which they are used. Discussions are had as tohow these architectures deal with the common problem of clock domain crossing and a more recent problemof how to deal with dynamic partial reconfiguration — the process by which a small portion of a design ischanged at run-time without effecting the operation of the rest of the device.The rest of this paper is organised as per the following. Section 2 presents architectures for creating anddistributing settings using a ‘global’ register map and an architecture for distributing the settings across thedevice in a method that is robust to multiple clocking domains and partial dynamic re-configuration. Metricsfor the architectures are presented in Sect. 3. Finally, conclusions are drawn in Sect. 4. The settings registers are usually considered as an area of memory, in which the stored values representmodes of operation for a design. These stored values are used throughout the design to influence operation.There are a number of ways to achieve the desired behaviour, the seemingly obvious is to simply referencethe values, stored in a global memory location, throughout the respective parts of the design, leading to arouting as in Fig. 1.Alternatively, distributing the memory map throughout the design moves the settings closer to wherethey are used. The result of which is to reduce the complexity of the routing, but not necessarily reducethe overall resource requirement. Local copies of the register map close to the point of which they are usedallows the designer to safely register the values into the appropriate clock domains. The additional flip-flopstages play an important part in breaking up the total routed path into smaller elements, the shorter thepath, the easier it is for a design to meet timing closure. However, the additional flip-flops used increase theoverall resource cost for a design. An example of such a design can be seen in Fig. 2.Distributing the memory map across the design can be achieved without the need for a increasing therouting complexity. Designing the distributed memory map with a common bus interface for its configuration,Fig. 3, reduces the overall resource cost and significantly reduces the required routing resource.The common bus interface has a number of benefits: reduced routing complexity, safe crossing intodifferent clock domains, reduction in global memory resources, connection into dynamically partially re-configurable logic space.
It is not uncommon for a modern digital system to use multiple clocks [33], in which data are moved fromone clock domain to another and memories are connected to different clock domains. Moving from oneclock domain to another requires the use of safe clock domain crossing domains - which in themselves area large research field [24, 7, 28, 32] - however they require using up yet more valuable resources. Typically2 ost controller landing point Memory map Module 2Module 1
Module N
DataValidReady ... / Figure 1: A global copy of the memory map is populated via the host controller, respective settings arerouted to the appropriate modules on multi-bit busses.
Host controller landing point Memory map Module 2Module 1
Module N
DataValidReady ... / L o c a l s e tt i n g s L o c a l s e tt i n g s L o c a l s e tt i n g s Figure 2: Entries from the global memory map are copied to where they are required locally.3 ost controller landing point Memory map module decoder Module 1 Module 2 Module N
DataValidReady ...
DataAddressSelect [0] [1] [N]
Ready
Figure 3: Bus connects elements of the design to a decoding module that distributes memory map informationacross the device. Uniform bus allows connection of partially dynamically re-configurable modules into thememory map bus.configuration data would be set in a slow clock domain and moved into much faster domains - potentially asvery wide, parallel busses.In addition to the increase in resources required for crossing clock domains, multi-clock systems lackdeterminism which causes problems for the verification process. Rectifying the non-deterministic natureof such systems and providing verification techniques (both stand-alone and built-in) is a rich source ofresearch [37, 23]. Additionally, frameworks for performing timing analyses and signal integrity in a CDCapplication [28, 32] have been proposed.The architecture presented here, fig. 3, exports a ‘Ready’ signal from each of the subsystems. The ‘Ready’signal is used to indicate that the logic has been moved to a safe state in which the local memory map maybe written to using the configuration bus. No changes are made to the local configuration memories whilelogic is operating, hence there is no danger of the registers being sampled while they are transitioning andthe clock domains are safely crossed.
Dynamic reconfiguration and Dynamic Partial Reconfiguration (DPR) is rapidly growing in popularity as itenables FPGA designs to be changed at run-time to better meet changing systems demands [25, 11]. Theuse of DPR is rapidly gaining popularity over a number of sectors including: fault recovery [5], memorycontrollers [36], real-time signal processing [12], software defined radio [35, 18, 17], cognitive radio [39],bandwidth reduction [30], video filters [20], and RADAR signal processing [41] to name a few.DPR designs contain a mix of static logic and re-configurable logic. Between the elements of the designa common interconnect is implemented, Fig. 4. The interconnect fabric contains the signals required for theconfiguration bus. When a module(s) in a re-configurable portion of the FPGA is changed, the configurationbus is connected into the new module along with all other data-path signals. Any settings registers insidepartially configured module are then set over the configuration bus.
Example designs of the above architectures were written using SystemVerilog (IEEE 1800) and processed usingIntel FPGA Quartus Prime 19.1.0 (Build 670); target device for compilation is a Cyclone V (5CSXFC6D6F31C8).Synthesis metrics — Adaptive Logic Modules (ALMs), registers, combinatorial Adaptive Look Up Tables(ALUTs) and maximum operating frequency — are presented for each architecture. Implementations aregiven for a variety of memory depths and widths. 4 tatic logic
Reconfigurable logic C o mm o n i n t e r f a c e l o g i c D a t a A dd r e ss S e l e c t R e a d y Figure 4: Partially re-configurable design showing the common programming interface in the interconnectlogic between static and re-configurable logic
Memory Depth [Words] A L M s i n f i na l f i t (a) Memory Depth [Words] A L M s i n f i na l f i t (b) Figure 5: ALMs used in final fit (Total ALMs less ALMs recovered from dense packing) for global memorymodule only. (a) Global memory has a registered output. (b) Global memory has no registered output.
Figures 5 to 8 show key metrics for an implementation of a global memory system. The global memorysystem contains the decoding logic for writing to the memory, the memory, and the output stage that wouldbe connected to the rest of the design. These figures do not include the resource consumption of slave moduleswhere the settings would be used and any clock domain crossing logic that may be implemented.ALMs (Intel) — similar to Configurable Logic Blocks (CLB) (Xilinx) — contain a number of resources,typically (A)LUTs, adders, multiplexers, routing logic, and registers [4]. From fig. 5 it is shown thatadding a register stage to the output of the memory significantly increases the number of ALMs neededfor implementation; for instance, in this case, 128 512-bit words with a final register stage require just over10,000 (10,292.6) more ALMs for implementation — approximately an extra 40%. Similarly, the numberof dedicated registers (fig. 6) requires an extra 65,536 dedicated logic registers — an approximately 100%5
100 200 300 400 500 600
Memory Depth [Words] R eg i s t e r s (a) Memory Depth [Words] R eg i s t e r s (b) Figure 6: Dedicated logic registers for global memory module only. (a) Global memory has a registeredoutput. (b) Global memory has no registered output.
Memory Depth [Words] C o m b i na t o r i a l A L U T s (a) Memory Depth [Words] C o m b i na t o r i a l A L U T s (b) Figure 7: Combinatorial ALUTs for global memory module only. (a) Global memory has a registered output.(b) Global memory has no registered output.increase in resource. Again, the number of ALUTs, fig. 7, has also increased by approximately 40%. This isto be expected since the implementation shown in subfigures (a) of figs. 5 to 8 have an extra register stageper bit of the memory map at the output.This is an obvious draw back in terms of resource consumption. However, the accompanying benefitsof the extra register stage is that the length of the routing between the memory and the target can nowbe broken down using the extra register stage. This manifests itself in an increase in operating frequencyfor the design. Figure 8 shows the maximum operating frequency of the implementation that uses an extraregister. While synthesising just the memory module itself we are unable to provide f max figures when thereis no additional output register because there are no valid paths (paths between two flip-flops) for which thetiming analyzer (TimeQuest) can operate. In sect. 3.1 the resource consumption for the memory decode logic and memory itself are shown. However,this is only half the story for a design that uses a global set of memory where entries are propagated outto other areas of the design. In this section we take a global memory system that a global memory of 25632-bit words and propagates these out to a slave module with a varying number of configuration registersin the slave module. In addition, designs that use a combination of output registers on the global memorymap, clock domain crossing registers (synchronisation chain length is 2 registers) and final location registersare examined.Figure 9 is the after fitting ALM requirements, fig. 10 is the after fitting register requirements, and fig. 11is the after fitting ALUT requirements for each configuration of the global memory map architecture. As isexpected, increasing the number of target registers linearly increases the requirement of each resource. Designs6
100 200 300 400 500 600
Memory Depth [Words] F m a x [ H z ] Figure 8: Maximum operating frequency (die temperature 85 ◦ C) for global memory module only. Data onlygiven for global memory with registered output.
Slave registers per slave A L M s i n f i na l f i t global reg, cc reg, final regcc reg, final regcc regglobal reg, cc reg global regfinal regfinal regglobal reg Figure 9: ALM consumption of global memory architecture with a single slave module using a variety ofconfiguration registers and routing registers.with a greater number of register stages (post global map register, clock domain crossing synchronisationchain registers and destination registers) significantly increases the resource requirements compared to designwith fewer register stages. 10099.1 ALMs, 38146 registers, and 1925 ALUTs for a design with 226 configurationregisters and the maximum number of routing register stages compared to 2710.5 ALMs, 8258 registers, and1913 ALUTs for a design with the same number of configuration registers but no register stages to breakdown the length of the routing. The more crowded a design becomes, the greater the impact of removing therouting registers has on the maximum speed of a path.7
50 100 150 200 250
Slave registers per slave R eg i s t e r s global reg, cc reg, final regcc reg, final regcc regglobal reg, cc reg global regfinal regfinal reg Figure 10: Register consumption of global memory architecture with a single slave module using a variety ofconfiguration registers and routing registers.
Slave registers per slave C o m b i na t o r i a l A L U T s global reg, cc reg, final regcc reg, final regcc regglobal reg, cc reg global regfinal regfinal reg Figure 11: ALUT consumption of global memory architecture with a single slave module using a variety ofconfiguration registers and routing registers. 8
50 100 150 200 250
Slave registers per slave A L M s i n f i na l f i t Figure 12: ALM consumption of distributed memory architecture with 1 to 4 slave module(s) and a varietyof configuration registers.
Slave registers per slave R eg i s t e r s Figure 13: Register consumption of distributed memory architecture with 1 to 4 slave module(s) and a varietyof configuration registers.
The resources required for the distributed configuration memory architecture, shown in figs. 12 to 14 areconsiderably lower than the global memory architecture. The graphs shown here are for implementationswith a number of slave modules (1 to 4) each implementation varies the number of configurations per slave.For comparisons numbers from the ‘1 slave’ implementations can be mapped to the results given in sect. 3.2.The resources used for a distributed configuration memory implementation using 226 target registers perslave are: 2556.0 ALMs, 7499 registers, and 1887 ALUTs. That is 25% of the ALMs, 20% of the registersused in the global design with maximum routing register. A significant cost saving. Increasing the numberof slaves in the design has a linear effect on the resource cost.9
50 100 150 200 250
Slave registers per slave C o m b i na t o r i a l A L U T s Figure 14: ALUT consumption of distributed memory architecture with 1 to 4 slave module(s) and a varietyof configuration registers.
Slave registers per slave F m a x [ H z ] global reg, cc reg, final regcc reg, final regcc regglobal reg, cc reg global regfinal regfinal reg (a) Slave registers per slave F m a x [ H z ] (b) Figure 15: f max for configuration memory architectures. (a) Global memory architecture. (b) distributedmemory architecture. Figure 15 shows that the maximum operating frequency of a design is also influenced by the topology of theconfiguration architecture. A global memory architecture achieves a maximum f max of just shy of 140 MHzcompared to the approximate 210 MHz of the distributed memory architecture. In this paper it has been shown that there are a number of ways to achieve the implementation of configura-tion registers in an FPGA design. In this paper we proposed a global memory architecture and a distributedmemory architecture, for completeness the global memory architecture was presented with combinations ofregister stages and clock domain crossing registers. It has been shown that the distributed architecture hasa much lower resource cost for ALMs and registers (as small as 25% and 20% respectively for a design using226 32-bit configuration registers). It has further been shown that there is a disparity in the maximum oper-ating frequency between the designs with the distributed memory architecture achieving a higher maximumoperating frequency.Aside from the reduction in resource cost between the different architectures, the distributed memory10rchitecture uses a common configuration bus that is independent of the number of target registers and theirwidth. The uniformity of the configuration bus opens up the ability to implement the configuration system ina partially re-configurable FPGA design, where the configuration bus can be connected to any re-configurabledesign without penalty. Similarly, the architecture of the configuration bus is not liable to mis-sampling whencrossing clock domains. It is set only when the slave module reports it is safe to change the settings.
References [1] AN 903: Accelerating timing closure: in Intel Quartus Prime Pro Edition. . Accessed: 2020-19-03.[2] Verilog parameters. http://verilog.renerta.com/mobile/source/vrg00032.htm . Accessed: 2020-19-03.[3] VHDL generics. http://vhdl.renerta.com/mobile/source/vhd00034.htm . Accessed: 2020-19-03.[4] Altera white paper FPGA architecture. , 2006. Accessed: 2020-19-03.[5] G. I. Alkady, N. A. El-Araby, M. B. Abdelhalim, H. H. Amer, and A. H. Madian. Dynamic faultrecovery using partial reconfiguration for highly reliable fpgas. In , pages 56–59, June 2015.[6] A. Azad and A. Bulu¸c. Distributed-memory algorithms for maximum cardinality matching in bipartitegraphs. In , pages32–42, May 2016.[7] M. Bartik. Clock domain crossing — an advanced course for future digital design engineers. In , pages 1–5, June 2018.[8] G. Cordasco, V. Scarano, and A. L. Rosenberg. Bounded-collision memory-mapping schemes for datastructures with applications to parallel memories.
IEEE Transactions on Parallel and Distributed Sys-tems , 18(7):973–982, July 2007.[9] A. Darte and Y. Robert. Communication-minimal mapping of uniform loop nests onto distributed mem-ory architectures. In
Proceedings of International Conference on Application Specific Array Processors(ASAP ’93) , pages 1–14, Oct 1993.[10] Greg Daughtry. Top 5 timing closure techniques. . Accessed: 2020-19-03.[11] X. Di, S. Fazhuang, D. Zhantao, and H. Wei. A design flow for fpga partial dynamic reconfiguration. In , pages 119–123, Dec 2012.[12] M. Feilen, M. Ihmig, A. Zahlheimer, and W. Stechele. Real-time signal processing on low-cost-fpgasusing dynamic partial reconfiguration. In , pages110–113, Dec 2011.[13] Feng Huang and J. Bacon. Operating system support for flexible coherence in distributed shared memory.In
Proceedings of HICSS-29: 29th Hawaii International Conference on System Sciences , volume 1, pages92–101 vol.1, Jan 1996.[14] Christian Fobel, Gary Gr´ewal, and Andrew Morton. Hardware accelerated fpga placement.
Microelec-tron. J. , 40(11):1667–1671, November 2009.[15] V. G. Gudise and G. K. Venayagamoorthy. Fpga placement and routing using particle swarm optimiza-tion. In
IEEE Computer Society Annual Symposium on VLSI , pages 307–308, Feb 2004.1116] Malay Haldar, Anshuman Nayak, Alok Choudhary, and Prith Banerjee. Parallel algorithms for fpgaplacement. In
Proceedings of the 10th Great Lakes Symposium on VLSI , GLSVLSI ’00, page 86–94, NewYork, NY, USA, 2000. Association for Computing Machinery.[17] A. Hassan, R. Ahmed, H. Mostafa, H. A. H. Fahmy, and A. Hussien. Performance evaluation of dynamicpartial reconfiguration techniques for software defined radio implementation on fpga. In , pages 183–186, Dec 2015.[18] S. Hosny, E. Elnader, M. Gamal, A. Hussien, A. H. Khalil, and H. Mostafa. A software defined radiotransceiver based on dynamic partial reconfiguration. In , pages158–161, Nov 2018.[19] Jong Hyuk Choi and Kyu Ho Park. Hybrid full map directory scheme for distributed shared memorymultiprocessors. In
Proceedings High Performance Computing on the Information Superhighway. HPCAsia ’97 , pages 30–34, April 1997.[20] R. Khraisha and J. Lee. A scalable h.264/avc deblocking filter architecture using dynamic partialreconfiguration. In ,pages 1566–1569, March 2010.[21] F. Klein, K. Beineke, and M. Sch¨ottner. Memory management for billions of small objects in a distributedin-memory storage. In , pages113–122, Sep. 2014.[22] L. I. Kontothanassis and M. L. Scott. Using memory-mapped network interfaces to improve the per-formance of distributed shared memory. In
Proceedings. Second International Symposium on High-Performance Computer Architecture , pages 166–177, Feb 1996.[23] C. Leong, P. Machado, V. Bexiga, J. P. Teixeira, I. C. Teixeira, J. C. Silva, P. Lous˜a, and J. Varela.Built-in clock domain crossing (cdc) test and diagnosis in gals systems. In , pages 72–77, April 2010.[24] Y. Li, B. Nelson, and M. Wirthlin. Synchronization techniques for crossing multiple clock domains infpga-based tmr circuits.
IEEE Transactions on Nuclear Science , 57(6):3506–3514, Dec 2010.[25] W. Lie and W. Feng-yan. Dynamic partial reconfiguration in fpgas. In , volume 2, pages 445–448, Nov 2009.[26] Z. Lin, D. H. P. Chau, and U. Kang. Leveraging memory mapping for fast and scalable graph computationon a pc. In , pages 95–98, Oct 2013.[27] V. M. Lo. Temporal communication graphs: A new graph theoretic model mapping and scheduling indistributed memory systems. In
The Sixth Distributed Memory Computing Conference, 1991. Proceed-ings , pages 248–252, April 1991.[28] A. Matsuda and Jin Zhang. Debugging methodology and timing analysis in cdc solution. In , pages 365–368, Oct 2011.[29] C. Mulpuri and S. Hauck. Runtime and quality tradeoffs in fpga placement and routing. In
FPGA’01: Proceedings of the 2001 ACM/SIGDA ninth international symposium on Field programmable gatearrays , pages 29–36, Feb 2001.[30] S. M. Najmabadi, Z. Wang, Y. Baroud, and S. Simon. Online bandwidth reduction using dynamicpartial reconfiguration. In , pages 168–171, May 2016.[31] A. Pal and M. Balakrishnan. A behavioral synthesis approach for distributed memory fpga architectures.In , pages 517–520, Aug2007. 1232] I. N. Preetam, P. Mazumder, T. S. Kumar, S. R. Krishna, and R. Kumawat. Design and verificationof ethernet, vme ip core using ace and cdc. In , pages 194–198, Feb 2015.[33] O. Ragheb and J. H. Anderson. High-level synthesis of fpga circuits with multiple clock domains. In , pages 109–116, April 2018.[34] J. H. Rutgers, M. J. G. Bekooij, and G. J. M. Smit. Portable memory consistency for software manageddistributed memory in many-core soc. In , pages 212–221, May 2013.[35] A. Sadek, H. Mostafa, and A. Nassar. On the use of dynamic partial reconfiguration for multi-band/multi-standard software defined radio. In , pages 498–499, Dec 2015.[36] K. Salah. An area efficient multi-mode memory controller based on dynamic partial reconfiguration.In , pages 328–331, Oct 2017.[37] M. Su, Y. Chen, and X. Gao. A general method to make multi-clock system deterministic. In , pages 1480–1485, March 2010.[38] H. Tirri and S. Mallenius. Optimizing the hard address distribution for sparse distributed memories.In
Proceedings of ICNN’95 - International Conference on Neural Networks , volume 4, pages 1966–1970vol.4, Nov 1995.[39] Wang Lie and Wu Feng-yan. Dynamic partial reconfiguration on cognitive radio platform. In , volume 4, pages 381–384,Nov 2009.[40] Michael G. Wrighton and Andr´e M. DeHon. Hardware-assisted simulated annealing with applicationfor fast fpga placement. In
Proceedings of the 2003 ACM/SIGDA Eleventh International Symposium onField Programmable Gate Arrays , FPGA ’03, page 33–42, New York, NY, USA, 2003. Association forComputing Machinery.[41] Y. Zhang, Z. Wang, and J. Wang. Integrated radar signal processing using fpga dynamic reconfiguration.In2016 CIE International Conference on Radar (RADAR)