[PDF] A distributed memory, local configuration technique for re-configurable logic designs

Abstract

The use and location of memory in integrated circuits plays a key factor in their performance. Memory requires large physical area, access times limit overall system performance and connectivity can result in large fan-out. Modern FPGA systems and ASICs contain an area of memory used to set the operation of the device from a series of commands set by a host. Implementing these settings registers requires a level of care otherwise the resulting implementation can result in a number of large fan-out nets that consume valuable resources complicating the placement of timing critical pathways. This paper presents an architecture for implementing and programming these settings registers in a distributed method across an FPGA and how the presented architecture works in both clock-domain crossing and dynamic partial re-configuration applications. The design is compared to that of a `global' settings register architecture. We implement the architectures using Intel FPGAs Quartus Prime software targeting an Intel FPGA Cyclone V. It is shown that the distributed memory architecture has a smaller resource cost (as small as 25% of the ALMs and 20% of the registers) compared to the global memory architectures.

Full PDF

AA distributed memory, local conﬁguration technique forre-conﬁgurable logic designs

Alexander E. Beasley ∗ Unconventional Computing Laboratory, UWE, Bristol, UK

March 25, 2020

Abstract

The use and location of memory in integrated circuits plays a key factor in their performance. Memoryrequires large physical area, access times limit overall system performance and connectivity can result inlarge fan-out. Modern FPGA systems and ASICs contain an area of memory used to set the operationof the device from a series of commands set by a host. Implementing these settings registers requiresa level of care otherwise the resulting implementation can result in a number of large fan-out nets thatconsume valuable resources complicating the placement of timing critical pathways. This paper presentsan architecture for implementing and programming these settings registers in a distributed method acrossan FPGA and how the presented architecture works in both clock-domain crossing and dynamic partialre-conﬁguration applications. The design is compared to that of a ‘global’ settings register architecture.We implement the architectures using Intel FPGAs Quartus Prime software targeting an Intel FPGACyclone V. It is shown that the distributed memory architecture has a smaller resource cost (as small as25% of the ALMs and 20% of the registers) compared to the global memory architectures.

The use of memory, memory accessing, and memory mapping techniques has a large impact on systemperformance [38, 8, 34]. Eﬃcient mapping techniques, reduction in communication overhead and the useof distributed memories can vastly increase the systems overall performance, particularly for intensive taskssuch as loops and scalable graph operations [9, 22, 26, 6, 27]. The implementation of memories inside anembedded system comes with many research possibilities. Memory technologies are becoming denser andfaster, allowing for higher density memory to be implemented close to its point of use. Despite this, memorystill requires large, physical space.Distributed memory, where the memory is close to the point at which it is used oﬀers huge beneﬁts, solong as the memories are kept coherent where necessary [19, 21, 13].Integrated circuits often require memory to store user deﬁned settings that control the mode of operation.Such examples could be the sample rate or resolution of a ADC; the applied phase shift of an RF phaseshifter; the gain of a variable gain ampliﬁer and so on.Field Programmable Gate Arrays (FPGAs) provide a ﬂexible platform for designers to fabricate seeminglyendless weird and wonderful systems. Quite often designers wish to make parameterisable systems wheretheir operation can be controlled by based on a number of settings. One way to achieve this is by use ofparameters [2] (or generics — VHDL [3]) that can be set at compile time. Parameters (generics) are avery powerful tool available in hardware descriptive languages, to create re-useable code. However, eachcombination of settings must be compiled separately, introducing a large amount of processing overhead andleading to a separate image ﬁle per conﬁguration. Alternatively, designers can implement an area of theFPGA as an array of registers in which settings can be stored and propagated across the design. Theseregisters can be programmed by means of a connection with a host system (typically USB in a modern ∗ corresponding author: Alexander Beasley, [email protected] a r X i v : . [ c s . A R ] M a r ystem). Implementing these features in a device create demand on resources, reducing the overall resourcesavailable to be used as functional logic.FPGA place and route stages are complicated procedures, attempting to locate resources as close aspossible to reduce routing complication and net delay [14, 40, 15, 16]. As the resources are ﬁxed, this oftenresults in trade-oﬀs between quality of the ﬁtter result and run-time of the ﬁtter [29]. In addition, largefan out nets often take priority during the ‘ﬁtting’ stage of an FPGAs compilation. Synthesis tools oftenattempt to insert extra resources or promote high fan-out nets to the clocking nets [1, 10]; reducing theavailable resources for timing critical pathways. This leads to more complicated designs that suﬀer frombottle-necking, manifesting itself as a reduction in the maximum operating frequency of a design.FPGAs have a large amount of memory distributed throughout the device. This memory neatly lendsitself for tasks where distributed memory, close to the point of use, such as loop operations and array intensiveoperations [31]. By extension, we can use the embedded memory blocks to create the sets of registers usedto set up and control the FPGA. Distributing the settings across the FPGA to their point of use helps toreduce the required routing resources, limit the high fan-out nets and improve timing closure.In this paper we explore how a typical ‘global’ register map expands with the number of required settingsfor a design and the width of each of these settings. The global register map is connected to modules in whicha varying number of entries in the global register map are used to help us model the resource requirementswhen fanning out the register map. A second architecture is presented that removes the global register mapand distributes the settings across the design to the points at which they are used. Discussions are had as tohow these architectures deal with the common problem of clock domain crossing and a more recent problemof how to deal with dynamic partial reconﬁguration — the process by which a small portion of a design ischanged at run-time without eﬀecting the operation of the rest of the device.The rest of this paper is organised as per the following. Section 2 presents architectures for creating anddistributing settings using a ‘global’ register map and an architecture for distributing the settings across thedevice in a method that is robust to multiple clocking domains and partial dynamic re-conﬁguration. Metricsfor the architectures are presented in Sect. 3. Finally, conclusions are drawn in Sect. 4. The settings registers are usually considered as an area of memory, in which the stored values representmodes of operation for a design. These stored values are used throughout the design to inﬂuence operation.There are a number of ways to achieve the desired behaviour, the seemingly obvious is to simply referencethe values, stored in a global memory location, throughout the respective parts of the design, leading to arouting as in Fig. 1.Alternatively, distributing the memory map throughout the design moves the settings closer to wherethey are used. The result of which is to reduce the complexity of the routing, but not necessarily reducethe overall resource requirement. Local copies of the register map close to the point of which they are usedallows the designer to safely register the values into the appropriate clock domains. The additional ﬂip-ﬂopstages play an important part in breaking up the total routed path into smaller elements, the shorter thepath, the easier it is for a design to meet timing closure. However, the additional ﬂip-ﬂops used increase theoverall resource cost for a design. An example of such a design can be seen in Fig. 2.Distributing the memory map across the design can be achieved without the need for a increasing therouting complexity. Designing the distributed memory map with a common bus interface for its conﬁguration,Fig. 3, reduces the overall resource cost and signiﬁcantly reduces the required routing resource.The common bus interface has a number of beneﬁts: reduced routing complexity, safe crossing intodiﬀerent clock domains, reduction in global memory resources, connection into dynamically partially re-conﬁgurable logic space.

It is not uncommon for a modern digital system to use multiple clocks [33], in which data are moved fromone clock domain to another and memories are connected to diﬀerent clock domains. Moving from oneclock domain to another requires the use of safe clock domain crossing domains - which in themselves area large research ﬁeld [24, 7, 28, 32] - however they require using up yet more valuable resources. Typically2 ost controller landing point Memory map Module 2Module 1

Module N

DataValidReady ... / Figure 1: A global copy of the memory map is populated via the host controller, respective settings arerouted to the appropriate modules on multi-bit busses.

Host controller landing point Memory map Module 2Module 1

Module N

DataValidReady ... / L o c a l s e tt i n g s L o c a l s e tt i n g s L o c a l s e tt i n g s Figure 2: Entries from the global memory map are copied to where they are required locally.3 ost controller landing point Memory map module decoder Module 1 Module 2 Module N

DataValidReady ...

DataAddressSelect [0] [1] [N]

Ready

Figure 3: Bus connects elements of the design to a decoding module that distributes memory map informationacross the device. Uniform bus allows connection of partially dynamically re-conﬁgurable modules into thememory map bus.conﬁguration data would be set in a slow clock domain and moved into much faster domains - potentially asvery wide, parallel busses.In addition to the increase in resources required for crossing clock domains, multi-clock systems lackdeterminism which causes problems for the veriﬁcation process. Rectifying the non-deterministic natureof such systems and providing veriﬁcation techniques (both stand-alone and built-in) is a rich source ofresearch [37, 23]. Additionally, frameworks for performing timing analyses and signal integrity in a CDCapplication [28, 32] have been proposed.The architecture presented here, ﬁg. 3, exports a ‘Ready’ signal from each of the subsystems. The ‘Ready’signal is used to indicate that the logic has been moved to a safe state in which the local memory map maybe written to using the conﬁguration bus. No changes are made to the local conﬁguration memories whilelogic is operating, hence there is no danger of the registers being sampled while they are transitioning andthe clock domains are safely crossed.

Dynamic reconﬁguration and Dynamic Partial Reconﬁguration (DPR) is rapidly growing in popularity as itenables FPGA designs to be changed at run-time to better meet changing systems demands [25, 11]. Theuse of DPR is rapidly gaining popularity over a number of sectors including: fault recovery [5], memorycontrollers [36], real-time signal processing [12], software deﬁned radio [35, 18, 17], cognitive radio [39],bandwidth reduction [30], video ﬁlters [20], and RADAR signal processing [41] to name a few.DPR designs contain a mix of static logic and re-conﬁgurable logic. Between the elements of the designa common interconnect is implemented, Fig. 4. The interconnect fabric contains the signals required for theconﬁguration bus. When a module(s) in a re-conﬁgurable portion of the FPGA is changed, the conﬁgurationbus is connected into the new module along with all other data-path signals. Any settings registers insidepartially conﬁgured module are then set over the conﬁguration bus.

Example designs of the above architectures were written using SystemVerilog (IEEE 1800) and processed usingIntel FPGA Quartus Prime 19.1.0 (Build 670); target device for compilation is a Cyclone V (5CSXFC6D6F31C8).Synthesis metrics — Adaptive Logic Modules (ALMs), registers, combinatorial Adaptive Look Up Tables(ALUTs) and maximum operating frequency — are presented for each architecture. Implementations aregiven for a variety of memory depths and widths. 4 tatic logic

Reconfigurable logic C o mm o n i n t e r f a c e l o g i c D a t a A dd r e ss S e l e c t R e a d y Figure 4: Partially re-conﬁgurable design showing the common programming interface in the interconnectlogic between static and re-conﬁgurable logic

Memory Depth [Words] A L M s i n f i na l f i t (a) Memory Depth [Words] A L M s i n f i na l f i t (b) Figure 5: ALMs used in ﬁnal ﬁt (Total ALMs less ALMs recovered from dense packing) for global memorymodule only. (a) Global memory has a registered output. (b) Global memory has no registered output.

Figures 5 to 8 show key metrics for an implementation of a global memory system. The global memorysystem contains the decoding logic for writing to the memory, the memory, and the output stage that wouldbe connected to the rest of the design. These ﬁgures do not include the resource consumption of slave moduleswhere the settings would be used and any clock domain crossing logic that may be implemented.ALMs (Intel) — similar to Conﬁgurable Logic Blocks (CLB) (Xilinx) — contain a number of resources,typically (A)LUTs, adders, multiplexers, routing logic, and registers [4]. From ﬁg. 5 it is shown thatadding a register stage to the output of the memory signiﬁcantly increases the number of ALMs neededfor implementation; for instance, in this case, 128 512-bit words with a ﬁnal register stage require just over10,000 (10,292.6) more ALMs for implementation — approximately an extra 40%. Similarly, the numberof dedicated registers (ﬁg. 6) requires an extra 65,536 dedicated logic registers — an approximately 100%5

100 200 300 400 500 600

Memory Depth [Words] R eg i s t e r s (a) Memory Depth [Words] R eg i s t e r s (b) Figure 6: Dedicated logic registers for global memory module only. (a) Global memory has a registeredoutput. (b) Global memory has no registered output.

Memory Depth [Words] C o m b i na t o r i a l A L U T s (a) Memory Depth [Words] C o m b i na t o r i a l A L U T s (b) Figure 7: Combinatorial ALUTs for global memory module only. (a) Global memory has a registered output.(b) Global memory has no registered output.increase in resource. Again, the number of ALUTs, ﬁg. 7, has also increased by approximately 40%. This isto be expected since the implementation shown in subﬁgures (a) of ﬁgs. 5 to 8 have an extra register stageper bit of the memory map at the output.This is an obvious draw back in terms of resource consumption. However, the accompanying beneﬁtsof the extra register stage is that the length of the routing between the memory and the target can nowbe broken down using the extra register stage. This manifests itself in an increase in operating frequencyfor the design. Figure 8 shows the maximum operating frequency of the implementation that uses an extraregister. While synthesising just the memory module itself we are unable to provide f max ﬁgures when thereis no additional output register because there are no valid paths (paths between two ﬂip-ﬂops) for which thetiming analyzer (TimeQuest) can operate. In sect. 3.1 the resource consumption for the memory decode logic and memory itself are shown. However,this is only half the story for a design that uses a global set of memory where entries are propagated outto other areas of the design. In this section we take a global memory system that a global memory of 25632-bit words and propagates these out to a slave module with a varying number of conﬁguration registersin the slave module. In addition, designs that use a combination of output registers on the global memorymap, clock domain crossing registers (synchronisation chain length is 2 registers) and ﬁnal location registersare examined.Figure 9 is the after ﬁtting ALM requirements, ﬁg. 10 is the after ﬁtting register requirements, and ﬁg. 11is the after ﬁtting ALUT requirements for each conﬁguration of the global memory map architecture. As isexpected, increasing the number of target registers linearly increases the requirement of each resource. Designs6

100 200 300 400 500 600

Memory Depth [Words] F m a x [ H z ] Figure 8: Maximum operating frequency (die temperature 85 ◦ C) for global memory module only. Data onlygiven for global memory with registered output.

Slave registers per slave A L M s i n f i na l f i t global reg, cc reg, final regcc reg, final regcc regglobal reg, cc reg global regfinal regfinal regglobal reg Figure 9: ALM consumption of global memory architecture with a single slave module using a variety ofconﬁguration registers and routing registers.with a greater number of register stages (post global map register, clock domain crossing synchronisationchain registers and destination registers) signiﬁcantly increases the resource requirements compared to designwith fewer register stages. 10099.1 ALMs, 38146 registers, and 1925 ALUTs for a design with 226 conﬁgurationregisters and the maximum number of routing register stages compared to 2710.5 ALMs, 8258 registers, and1913 ALUTs for a design with the same number of conﬁguration registers but no register stages to breakdown the length of the routing. The more crowded a design becomes, the greater the impact of removing therouting registers has on the maximum speed of a path.7

50 100 150 200 250

Slave registers per slave R eg i s t e r s global reg, cc reg, final regcc reg, final regcc regglobal reg, cc reg global regfinal regfinal reg Figure 10: Register consumption of global memory architecture with a single slave module using a variety ofconﬁguration registers and routing registers.

Slave registers per slave C o m b i na t o r i a l A L U T s global reg, cc reg, final regcc reg, final regcc regglobal reg, cc reg global regfinal regfinal reg Figure 11: ALUT consumption of global memory architecture with a single slave module using a variety ofconﬁguration registers and routing registers. 8

50 100 150 200 250

Slave registers per slave A L M s i n f i na l f i t Figure 12: ALM consumption of distributed memory architecture with 1 to 4 slave module(s) and a varietyof conﬁguration registers.

Slave registers per slave R eg i s t e r s Figure 13: Register consumption of distributed memory architecture with 1 to 4 slave module(s) and a varietyof conﬁguration registers.

The resources required for the distributed conﬁguration memory architecture, shown in ﬁgs. 12 to 14 areconsiderably lower than the global memory architecture. The graphs shown here are for implementationswith a number of slave modules (1 to 4) each implementation varies the number of conﬁgurations per slave.For comparisons numbers from the ‘1 slave’ implementations can be mapped to the results given in sect. 3.2.The resources used for a distributed conﬁguration memory implementation using 226 target registers perslave are: 2556.0 ALMs, 7499 registers, and 1887 ALUTs. That is 25% of the ALMs, 20% of the registersused in the global design with maximum routing register. A signiﬁcant cost saving. Increasing the numberof slaves in the design has a linear eﬀect on the resource cost.9

50 100 150 200 250

Slave registers per slave C o m b i na t o r i a l A L U T s Figure 14: ALUT consumption of distributed memory architecture with 1 to 4 slave module(s) and a varietyof conﬁguration registers.

Slave registers per slave F m a x [ H z ] global reg, cc reg, final regcc reg, final regcc regglobal reg, cc reg global regfinal regfinal reg (a) Slave registers per slave F m a x [ H z ] (b) Figure 15: f max for conﬁguration memory architectures. (a) Global memory architecture. (b) distributedmemory architecture. Figure 15 shows that the maximum operating frequency of a design is also inﬂuenced by the topology of theconﬁguration architecture. A global memory architecture achieves a maximum f max of just shy of 140 MHzcompared to the approximate 210 MHz of the distributed memory architecture. In this paper it has been shown that there are a number of ways to achieve the implementation of conﬁgura-tion registers in an FPGA design. In this paper we proposed a global memory architecture and a distributedmemory architecture, for completeness the global memory architecture was presented with combinations ofregister stages and clock domain crossing registers. It has been shown that the distributed architecture hasa much lower resource cost for ALMs and registers (as small as 25% and 20% respectively for a design using226 32-bit conﬁguration registers). It has further been shown that there is a disparity in the maximum oper-ating frequency between the designs with the distributed memory architecture achieving a higher maximumoperating frequency.Aside from the reduction in resource cost between the diﬀerent architectures, the distributed memory10rchitecture uses a common conﬁguration bus that is independent of the number of target registers and theirwidth. The uniformity of the conﬁguration bus opens up the ability to implement the conﬁguration system ina partially re-conﬁgurable FPGA design, where the conﬁguration bus can be connected to any re-conﬁgurabledesign without penalty. Similarly, the architecture of the conﬁguration bus is not liable to mis-sampling whencrossing clock domains. It is set only when the slave module reports it is safe to change the settings.

References [1] AN 903: Accelerating timing closure: in Intel Quartus Prime Pro Edition. . Accessed: 2020-19-03.[2] Verilog parameters. http://verilog.renerta.com/mobile/source/vrg00032.htm . Accessed: 2020-19-03.[3] VHDL generics. http://vhdl.renerta.com/mobile/source/vhd00034.htm . Accessed: 2020-19-03.[4] Altera white paper FPGA architecture. , 2006. Accessed: 2020-19-03.[5] G. I. Alkady, N. A. El-Araby, M. B. Abdelhalim, H. H. Amer, and A. H. Madian. Dynamic faultrecovery using partial reconﬁguration for highly reliable fpgas. In , pages 56–59, June 2015.[6] A. Azad and A. Bulu¸c. Distributed-memory algorithms for maximum cardinality matching in bipartitegraphs. In , pages32–42, May 2016.[7] M. Bartik. Clock domain crossing — an advanced course for future digital design engineers. In , pages 1–5, June 2018.[8] G. Cordasco, V. Scarano, and A. L. Rosenberg. Bounded-collision memory-mapping schemes for datastructures with applications to parallel memories.

IEEE Transactions on Parallel and Distributed Sys-tems , 18(7):973–982, July 2007.[9] A. Darte and Y. Robert. Communication-minimal mapping of uniform loop nests onto distributed mem-ory architectures. In

Proceedings of International Conference on Application Speciﬁc Array Processors(ASAP ’93) , pages 1–14, Oct 1993.[10] Greg Daughtry. Top 5 timing closure techniques. . Accessed: 2020-19-03.[11] X. Di, S. Fazhuang, D. Zhantao, and H. Wei. A design ﬂow for fpga partial dynamic reconﬁguration. In , pages 119–123, Dec 2012.[12] M. Feilen, M. Ihmig, A. Zahlheimer, and W. Stechele. Real-time signal processing on low-cost-fpgasusing dynamic partial reconﬁguration. In , pages110–113, Dec 2011.[13] Feng Huang and J. Bacon. Operating system support for ﬂexible coherence in distributed shared memory.In

Proceedings of HICSS-29: 29th Hawaii International Conference on System Sciences , volume 1, pages92–101 vol.1, Jan 1996.[14] Christian Fobel, Gary Gr´ewal, and Andrew Morton. Hardware accelerated fpga placement.

Microelec-tron. J. , 40(11):1667–1671, November 2009.[15] V. G. Gudise and G. K. Venayagamoorthy. Fpga placement and routing using particle swarm optimiza-tion. In

IEEE Computer Society Annual Symposium on VLSI , pages 307–308, Feb 2004.1116] Malay Haldar, Anshuman Nayak, Alok Choudhary, and Prith Banerjee. Parallel algorithms for fpgaplacement. In

Proceedings of the 10th Great Lakes Symposium on VLSI , GLSVLSI ’00, page 86–94, NewYork, NY, USA, 2000. Association for Computing Machinery.[17] A. Hassan, R. Ahmed, H. Mostafa, H. A. H. Fahmy, and A. Hussien. Performance evaluation of dynamicpartial reconﬁguration techniques for software deﬁned radio implementation on fpga. In , pages 183–186, Dec 2015.[18] S. Hosny, E. Elnader, M. Gamal, A. Hussien, A. H. Khalil, and H. Mostafa. A software deﬁned radiotransceiver based on dynamic partial reconﬁguration. In , pages158–161, Nov 2018.[19] Jong Hyuk Choi and Kyu Ho Park. Hybrid full map directory scheme for distributed shared memorymultiprocessors. In

Proceedings High Performance Computing on the Information Superhighway. HPCAsia ’97 , pages 30–34, April 1997.[20] R. Khraisha and J. Lee. A scalable h.264/avc deblocking ﬁlter architecture using dynamic partialreconﬁguration. In ,pages 1566–1569, March 2010.[21] F. Klein, K. Beineke, and M. Sch¨ottner. Memory management for billions of small objects in a distributedin-memory storage. In , pages113–122, Sep. 2014.[22] L. I. Kontothanassis and M. L. Scott. Using memory-mapped network interfaces to improve the per-formance of distributed shared memory. In

Proceedings. Second International Symposium on High-Performance Computer Architecture , pages 166–177, Feb 1996.[23] C. Leong, P. Machado, V. Bexiga, J. P. Teixeira, I. C. Teixeira, J. C. Silva, P. Lous˜a, and J. Varela.Built-in clock domain crossing (cdc) test and diagnosis in gals systems. In , pages 72–77, April 2010.[24] Y. Li, B. Nelson, and M. Wirthlin. Synchronization techniques for crossing multiple clock domains infpga-based tmr circuits.

IEEE Transactions on Nuclear Science , 57(6):3506–3514, Dec 2010.[25] W. Lie and W. Feng-yan. Dynamic partial reconﬁguration in fpgas. In , volume 2, pages 445–448, Nov 2009.[26] Z. Lin, D. H. P. Chau, and U. Kang. Leveraging memory mapping for fast and scalable graph computationon a pc. In , pages 95–98, Oct 2013.[27] V. M. Lo. Temporal communication graphs: A new graph theoretic model mapping and scheduling indistributed memory systems. In

The Sixth Distributed Memory Computing Conference, 1991. Proceed-ings , pages 248–252, April 1991.[28] A. Matsuda and Jin Zhang. Debugging methodology and timing analysis in cdc solution. In , pages 365–368, Oct 2011.[29] C. Mulpuri and S. Hauck. Runtime and quality tradeoﬀs in fpga placement and routing. In

FPGA’01: Proceedings of the 2001 ACM/SIGDA ninth international symposium on Field programmable gatearrays , pages 29–36, Feb 2001.[30] S. M. Najmabadi, Z. Wang, Y. Baroud, and S. Simon. Online bandwidth reduction using dynamicpartial reconﬁguration. In , pages 168–171, May 2016.[31] A. Pal and M. Balakrishnan. A behavioral synthesis approach for distributed memory fpga architectures.In , pages 517–520, Aug2007. 1232] I. N. Preetam, P. Mazumder, T. S. Kumar, S. R. Krishna, and R. Kumawat. Design and veriﬁcationof ethernet, vme ip core using ace and cdc. In , pages 194–198, Feb 2015.[33] O. Ragheb and J. H. Anderson. High-level synthesis of fpga circuits with multiple clock domains. In , pages 109–116, April 2018.[34] J. H. Rutgers, M. J. G. Bekooij, and G. J. M. Smit. Portable memory consistency for software manageddistributed memory in many-core soc. In , pages 212–221, May 2013.[35] A. Sadek, H. Mostafa, and A. Nassar. On the use of dynamic partial reconﬁguration for multi-band/multi-standard software deﬁned radio. In , pages 498–499, Dec 2015.[36] K. Salah. An area eﬃcient multi-mode memory controller based on dynamic partial reconﬁguration.In , pages 328–331, Oct 2017.[37] M. Su, Y. Chen, and X. Gao. A general method to make multi-clock system deterministic. In , pages 1480–1485, March 2010.[38] H. Tirri and S. Mallenius. Optimizing the hard address distribution for sparse distributed memories.In

Proceedings of ICNN’95 - International Conference on Neural Networks , volume 4, pages 1966–1970vol.4, Nov 1995.[39] Wang Lie and Wu Feng-yan. Dynamic partial reconﬁguration on cognitive radio platform. In , volume 4, pages 381–384,Nov 2009.[40] Michael G. Wrighton and Andr´e M. DeHon. Hardware-assisted simulated annealing with applicationfor fast fpga placement. In

Proceedings of the 2003 ACM/SIGDA Eleventh International Symposium onField Programmable Gate Arrays , FPGA ’03, page 33–42, New York, NY, USA, 2003. Association forComputing Machinery.[41] Y. Zhang, Z. Wang, and J. Wang. Integrated radar signal processing using fpga dynamic reconﬁguration.In2016 CIE International Conference on Radar (RADAR)