[PDF] HIPE-MAGIC: A Technology-Aware Synthesis and Mapping Flow for HIghly Parallel Execution of Memristor-Aided LoGIC

Abstract

Recent efforts for finding novel computing paradigms that meet today's design requirements have given rise to a new trend of processing-in-memory relying on non-volatile memories. In this paper, we present HIPE-MAGIC, a technology-aware synthesis and mapping flow for highly parallel execution of the memristor-based logic. Our framework is built upon two fundamental contributions: balancing techniques during the logic synthesis, mainly targeting benefits of the parallelism offered by memristive crossbar arrays (MCAs), and an efficient technology mapping framework to maximize the performance and area-efficiency of the memristor-based logic. Our experimental evaluations across several benchmark suites demonstrate the superior performance of HIPE-MAGIC in terms of throughput and energy efficiency compared to recently developed synthesis and mapping flows targeting MCAs, as well as the conventional CPU computing.

Full PDF

HHIPE-MAGIC: A Technology-Aware Synthesis and Mapping Flowfor HIghly Parallel Execution of Memristor-Aided LoGIC

Arash Fayyazi ∗ University of Southern CaliforniaLos Angeles, [email protected]

Amirhossein Esmaili ∗ University of Southern CaliforniaLos Angeles, [email protected]

Massoud Pedram

University of Southern CaliforniaLos Angeles, [email protected]

ABSTRACT

Recent efforts for finding novel computing paradigms that meettoday’s design requirements have given rise to a new trend ofprocessing-in-memory relying on non-volatile memories. In thispaper, we present HIPE-MAGIC, a technology-aware synthesis andmapping flow for highly parallel execution of the memristor-basedlogic. Our framework is built upon two fundamental contributions:balancing techniques during the logic synthesis, mainly targetingbenefits of the parallelism offered by memristive crossbar arrays(MCAs), and an efficient technology mapping framework to maxi-mize the performance and area-efficiency of the memristor-basedlogic. Our experimental evaluations across several benchmark suitesdemonstrate the superior performance of HIPE-MAGIC in termsof throughput and energy efficiency compared to recently devel-oped synthesis and mapping flows targeting MCAs, as well as theconventional CPU computing.

In a classical von Neumann architecture, the processor and memoryare separate entities, and thus there is the need for the data transfersbetween them to perform a computational task. These data transferslead to performance degradation and energy consumption[6, 17, 18].The situation is exacerbated because of the limited cache capacityand on-chip bandwidth[1], which give rise to long idle time inter-vals due to the memory synchronization during a computation[22].The processing-in-memory (PIM) approach attempts to address thisissue by breaking the von Neumann separation, and eliminatingthe need to transfer data between the processor and memory, sincedata already resides in the memory [7]. Emergence of new non-volatile memory technologies such as memristors which can beemployed to perform logic operations, has provided new avenuesfor employing PIM. Memristors offer relatively high endurance,switching speeds of less than 10 ns, and data retention of about 7years [13]. One approach for performing logic computations withina conventional memristive crossbar array (MCA) is to use a statefullogic design style. In the stateful logic, logic states are representedin terms of variable resistance values of memristors. Typically, thehigh resistance state of a memristor is used for presenting logic 0whereas the low resistance state is used for presenting logic 1. IM-PLY [3] and MAGIC [19] are two examples of widely used statefullogic design styles.There has been a body of research on developing an automaticframework for the efficient implementation of arbitrary logic func-tions within a MCA based on the IMPLY [5, 9, 11] or MAGIC designstyle [2, 10, 20, 21]. The MAGIC-based implementation requires ∗ Both authors contributed equally to this research. memristors only whereas the IMPLY-based implementation needsboth memristors and resistors to realize the material implicationfunction. Furthermore, IMPLY-based solutions rely on some customdata structures which are not typically part of a standard synthesisflow, while MAGIC-based implementations use the conventionallogic NOR gate. In addition, the NOR gates in the MAGIC designstyle can be implemented in the crossbar alongside both rows andcolumns. This allows parallel operations of several NOR gates,which in turn tends to decrease the computation latency.The traditional flow for implementing logic functions within aMCA consists of two phases: logic optimization and technologymapping. Previous work typically employ a standard synthesis tool,such as the ABC tool [4], to obtain a gate-level netlist, and focusmore on the second phase which is assigning suitable memristorson the MCA to logic gates in order to minimize delay and/or areawhile satisfying the required spatial alignment constraints for mem-ristors within the MCA. This approach does not result in an efficientlogic implementation on the MCA because it does not consider theadvantages that the underlying 2D grid of the MCA can offer interms of parallel computation of logic operations.This paper presents HIPE-MAGIC, a synthesis and mapping flowwhich exploits the native characteristics of MCA. All in-memorycomputations in this work rely on the MAGIC style of computing.The main contributions of this work are four-fold: • A look-up table (LUT)-based synthesis flow (where eachLUT groups several MAGIC NOR gates), in which the flowis accompanied by balancing operations that provide moreopportunities for parallelization of computations in the un-derlying MCA by reducing the logic depth of the targetcircuit to be mapped to MCA. • A heuristic mapping strategy which helps improve both thethe area efficiency and computation latency of the MCA,realizing an LUT-based logic network. • Improving state-of-the-art synthesis and mapping flows tar-geting MCAs. More precisely, HIPE-MAGIC improves boththe average computation latency (by a factor of 2.1) and area(by a factor of 1.4) across ISCAS’85 and IWLS’93 combina-tional circuit benchmark suites compared to prior work. • A detailed analysis of strengths and weaknesses of our PIMsolution compared to the conventional CPU computing (asan example of von Neumann architecture), based on a pa-rameterized analytical model. The analysis demonstratesthat PIM-based solution generated by HIPE-MAGIC havesuperior performance in terms of throughput and energy ef-ficiency compared to those generated by recently developedsynthesis and mapping flows, as well as the conventionalCPU. a r X i v : . [ c s . ET ] J un hile the scope of this work is focused on a synthesis and map-ping flow in PIM paradigm and not comparing the PIM approachwith the traditional CPU computing, we also present a analyticalcomparison between the two in Section 4.2 in terms of throughputand energy efficiency based on a parameterized analytical model.The remainder of the paper is organized as follows. Sec. 2 describescentral preliminary concepts used in this work. Details on the newlyproposed synthesis and mapping frameworks of HIPE-MAGIC arepresented in Sec. 3, while Sec. 4 studies the effectiveness of HIPE-MAGIC. Conclusions are drawn in Sec. 5. All in-memory computations in this work rely on the basic MAGICNOR operation. We first describe two phases of logic synthesis,technology-independent and technology-dependent phases. Then,we explain MAGIC style operation and supergate-aided synthesisand mapping techniques, where the latter drives our choice for theproposed flow in this work. Finally, we explain how MCA motivatesour choice to optimize both phases of the logic synthesis.

Logic synthesis is divided into technology-independent and technology-dependent phases. In the first phase,algebraic transformations are performed to reduce the number ofliterals in the optimal factored form of a given Boolean expres-sion and consequently reduce its area and delay. These algebraicoptimizations are interleaved with node simplification operationsto utilize controllability and observability don’t cares and therebyachieve even more reduction in area or delay of the synthesized net-work. Next, in preparation for the technology-dependent phase, theoptimized Boolean network is transformed into a common semanticdomain e.g., an And Inverter Graph (AIG). Note that the librarycells are also represented as pattern graphs in the same domain.This step is called technology decomposition, and such a network iscalled a subject graph. Finally, as the technology-dependent phase,subgraphs of the subject graph are mapped to suitable patterngraphs (e.g., gates from library cells, or LUTs for FPGAs) in orderto minimize the target cost function, e.g, delay or area. This is agraph covering problem.

The MAGIC NOR function is performed byapplying a single operating voltage ( V G ) to the input memristor(s)to initialize the output memristor to logic 0 (high resistance value).The state of the output memristor (which is also considered as amemory cell) changes pursuant to the logical states of the inputmemristor(s). The NOT function is realized as a single-input NORfunction. The spatial requirement for a MAGIC gate is that its in-put(s) and output must be located on the same row, or in the samecolumn of the memristor array. Furthermore, to execute multiplerow-wise (or column-wise) MAGIC gates in parallel, their input andoutput memristors should be aligned along the same set of columns(rows). Fig. 1 depicts crossbar configurations for in-memory exe-cution of two row-wise and two column-wise MAGIC NOR gatesin parallel. This figure shows the advantages of MAGIC operationsuch as the separation between the input(s) and output memristors,the need for only a single V G , absence of additional peripheral el-ements, and the ability to perform multiple operations in parallel Figure 1:

Parallel execution of two aligned MAGIC NOR gates in a)row-wise and (b) column-wise manners. [10]. Since NOR is a complete logic function, a MAGIC NOR gate issufficient for the execution of any logic operation within the MCA.

Tenaceet al. [20] proposed to use k -input LUTs for mapping the subgraphsof an AIG subject graph to the MCA. In their approach, these LUTsare subsequently transformed to NOR-of-NORs (NoN) supergates and implemented on the MCA. Any single-output Boolean function F : B n → B represented by a Sum of Products (SOP) associatedwith a single LUT can be translated to NoN following these threesteps: 1) negating all primary inputs, 2) replacing all AND and ORoperations ( ∧ and ∨ ) with NOR operations. ( ∨ ), and 3) negatingthe result. The crossbar mapping scheme proposed in [20] for thesesupergates is a tile-based mapping, where each tile contains a su-pergate and is expressed by two indices l and y representing thecolumn and row coordinates of that tile, adhering to some spatialrules. All supergates from one logic level are placed in one columnand sorted by their size (i.e., the number of their LUT terms). Themapping scheme in [20] also imposes a constraint that a supergatein ( l , y ) position must be smaller in size compared to the supergateplaced in ( l − , y ) tile.An example of the tile-based mapping for a single-bit full adderis depicted in Fig. 2 (a). An output of a supergate can be generatedas a sequence of three MAGIC functions as suggested in Fig. 2 (b).To cascade supergates, alignment copy operations are performed tosatisfy the spatial constraints of MAGIC gates. Depending on thepolarity of input/output signals, the data alignment copy operationsare either NOT copies, or (NOT) copies (see Fig. 2 (c)). (NOT) copy represent two consecutive NOT copies which use one extraauxiliary memristor for the intermediate signal.A major challenge of prior work including and [20] and [21] is thelarge computational latency due to alignment copy operations. Forinstance, main logic operations in Fig. 2 takes only five cycles whilealignment copy operations takes 12 cycles. HIPE-MAGIC tacklesthis challenge by devising a mapping scheme that can further reduceoperational computation by overlapping input/output signals aswell as sharing alignment copy operations leveraging heuristicplacements. Previous work on optimizing synthesis flow for CMOS illustratethe effect of the balancing operation that reduces the logic depth ofa combinational circuit on reducing its delay. In [14], Michenko etal. proposed AND balancing and SOP balancing algorithms for thedelay optimization. AND-balancing of an AIG is a well-known fast igure 2: Implemented 1-bit full adder based on [20]: (a) tile-basedmapping, (b) sequence of operations within a supergate, and (c)MCA mapping. 17 cycles of PIM operations and 30 memristors areneeded. Copy(NOT) takes one cycle, whereas Copy(NOT) takes twocycles of PIM operations. transformation that reduces the number of AIG levels. Pasandi andPedram [16] also presented balanced factorization and rewritingalgorithms by considering an imbalance factor when calculating thevalue of a potential factorization. Their results show the importanceof the AIG balancing operation in minimizing the delay.To perform the SOP balancing, a small AIG (e.g., an AIG thatdepends on say 10 or fewer inputs) is converted into a SOP. Next,the SOP balancing applies AND-balancing to each product and sub-sequently to the sum operation. Fig. 3 illustrates the SOP-balancingtechnique on a small AIG (inspired by [14]), and how this can betranslated into delay optimization of the logic implementation ofMAGIC NORs. The total computational cycles needed for imple-menting this function on the memristive crossbar (e.g., number oflogic levels) is reduced from 5 in Fig. 3 (c) to 3 in Fig. 3 (d). This exam-ple motivates us to leverage balancing operations, which can leadto improvements for MCA-based PIM by enabling more operationsto be performed in parallel. HIPE-MAGIC’s flow is divided into two main steps:

Figure 3:

An example of SOP balancing: (a) an unoptimized and(b) optimized implementation in AIG, which can be translated to (c)an unoptimized and (d) optimized implementation in MAGIC style,respectively. • Logic Synthesis:

The input is an arbitrary logic function,and the output is a netlist comprising optimized mappedLUT representation of the input function. • MCA Mapping:

The input is a netlist of synthesized LUTs,and the output is the placement of each supergate withinthe 2D memristive crossbar array along with a schedule ofthe sequence of operations.

Logic synthesis for enabling PIM is a crucial step in the MCA designflow with a significant impact on the total area and performance.The underlying 2D grid in MCAs offers the opportunity for concur-rently performing many logic operations in each logic level. Thisopportunity should be considered during the logic synthesis flow.Similar to prior work, we utilize the ABC tool [4], but with theaddition of balancing operations. ABC receives an arbitrary logicfunction, and generates a netlist of optimized LUTs while mini-mizing the number of gates and logic levels. HIPE-MAGIC runsbalancing operations in two phases within the ABC tool:(1)

Technology-independent phase: on the AIG itself (i.e., beforemapping the Boolean function to LUTs).(2)

Technology-dependent phase: on the priority cuts (i.e., duringthe LUT mapping process).

Technology-independent Phase:

We have added balancedfactorization and rewriting algorithms [16] as commands to ABC[4]. These commands tend to reduce the size of the AIG represent-ing the input logic network, and also introduce more parallelism inthe AIG by utilizing the balancing operations. We have used thefollowing optimization scripts: {blnc_syn2, resyn, resyn2, resyn2rs,compress2rs} where blnc_syn2 is a set of optimization commands,which is interleaved with balanced rewriting and factorization com-mands together with the AND balancing command of ABC. The ther commands are heuristics methods, provided by the ABC syn-thesis to further optimize the AIG network. Technology-dependent Phase:

In general, AND-balancing,balanced factorization, and rewriting algorithms are limited to AIGand AND gates, while SOP-balancing can look at larger functions(i.e., the priority-cuts which are used for the k -input LUT mapperin ABC [4], where k denotes the number of LUT inputs). There-fore, SOP-balancing can be applied during the LUT mapping toreduce delay in many cases. However, it is impossible to apply SOP-balancing algorithms to each of the AIG nodes since large designcan have millions of AIG nodes. For this purpose, a large AIG isdivided into parts (using cutlines), and the SOP-balancing operationis carried out for each of the cuts (for an example, see Fig. 3).The SOP-balancing operation is implemented in ABC, and calledthrough the priority-cut based mapper if command. We called ouroptimization script TechDepOpt , which performs the technology-dependent synthesis targeting MCA. The sequence of ABC com-mands are: st; if -K k ; (st; dch; if -K k ) ; st; if -g -C c -K l ; (st;dch; if -K k ) ; st; dch; if -K k -S s . In the

TechDepOpt , -g enablesSOP-balancing for the cut evaluation and c is the number of l -inputcuts computed at each node in the subject graph. dch performsthe AIG-based synthesis with a repeated sequence of technology-independent optimizations on different structural choices (whichare functionally equivalent networks obtained by running AIGrewriting scripts on the current network), and st transforms thenetwork back to the AIG form. To ensure that the delay after map-ping into k -LUT can be reduced, the cut size for SOP-balancingoperation ( l in " if -g -C c -K l ") is set to a number larger than thecut size for the final mapping command ( k in " if -K k -S s "). Usinga larger cut size for SOP balancing operation can potentially resultin better delay optimization. The method presented in [20], for mapping a netlist of supergates tothe MCA, tends to limit the potential performance gain that can beobtained by the parallelization potential offered by the supergates.It also introduces a number of redundant operations, according tohow it employs the SOP-to-NoN translation, which can be avoided.Furthermore, to make better use of the 2D crossbar grid, the area-efficiency of mapped gates, associated with how densly memristorsused for logic computations are placed next to each other in rowsand columns, is preferable. However, the mapping rules presentedin [20] may result in a sparse mapping of a netlist (e.g., see Fig.2), although in fact a denser and more area-efficient mapping maybe possible. In addition, the overhead associated with the dataalignment copies can be alleviated by reusing auxiliary memristors.Based on above observations, certain refinements and mappingoperations are used as shown in Algorithm 1 and elaborated onbelow.

MCA-Driven SOP-to-NoN Translation:

For each pair of cas-caded LUTs, we merge the third step of the SoP-to-NoN mappingof the second LUT, cf. Section 2.1.3 above, with the first step ofthe SoP-to-NoN mapping of the first LUT (this is done as partof

Map_SOP_to_NoN() function in Algorithm 1). For instance, as-sume a Boolean network which is a graph of connected blocks (e.g., † The number of rounds of running the sequence of the commands in parenthesis.

Figure 4:

An example of MCA-driven SOP-to-NoN translation: (a) f j is a logic function (represented in the SOP form), (b) original trans-lation to NoN where д j is a logic function composed of only NORfunctions, and (c) refined translation. Small circles denote the NOTfunction. nodes, LUTs). Here, the individual component blocks are 2-levelBoolean functions in the SOP form, which can be represented by y j = f j ( x , y ) where y j denotes the node’s output variable and x and y denote node’s input variables, respectively (see Fig. 4 (a)).There exists an edge ( i , j ) if f j depends explicitly on y i ; differentlyworded, there is a connection between the j th node and the i th node. The SOP-to-NoN translation imposes that each supergate(e.g., the node that is created after translation) depends on negatedvalue of its input nodes, д j (¬ x , ¬ y ) as shown in Fig. 4 (b) where y j = ¬ д j ( ..., ¬ y i , ... ) . Therefore, y i is calculated by performing aNOT function on ¬ y i which is the node variable of д i . This NOTfunction is equivalent to the last step in the translation of the SOPrepresentation of f i into the NoN representation. Eliminating thetwo NOT functions on an edge between two cascaded nodes (seeFig. 4 (c)) reduces the number of operations required for implement-ing the target Boolean network. In other words, for each supergate(except for the ones that produce the primary outputs), we discardthe third operation (e.g., the NOT function in Fig. 2 (b)). Placement of Supergates:

The spatial rules for the mappingprocedure mentioned in Section 2.1.3, rely on the intuition that asthe logic level increases, the supergate size decreases. However, thisintuition does not always hold true. For instance, in the mappingof the one-bit full adder depicted in Fig. 2, S , and S , will bepushed down in each column (logic level) following the spatialrules although they can be placed higher up in those columns.These replacements increase the area-efficiency of the mapped lgorithm 1: The proposed mapping scheme

Input:

The synthesized netlist of LUTs L : Number of logic levels Output:

Mapping of supergates and alignment copies// MCA-Driven SOP-to-NoN Mapping: Map_SOP_to_NoN();// Placement of supergates: for l = l ≤ L ; l = l + do Sort_Supergates( l ); // sort supergates in level l if ( l mod 2 == ) then Flip_Supergates( l ); // flip supergates in level l vertically end Place_Supergates;( l ) // place supergates in level l end // Resource Sharing for Data Alignment Copies: Share_Align_Memristors();supergates, which thus make better use of the underlying 2D gridof memristors.We first sort the supergates in each logic level ( l ) in the decreasingorder of their size (( Sort_Supergates( l ) function in Algorithm 1)). Totake advantage of the underlying 2D grid, we alter the directionof the vertical NOR in supergates for consecutive columns (thisis done in Flip_Supergates( l ) function in Algorithm 1). Next, weplace the supergates of first logic level ( l =

1) in the first columnfollowing the sorting order. For subsequent columns, we go throughthe sorted supergates in each level and place each supergate in thefirst available vertical position ( y ) in the column where the outputof the supergate in the same y and level l − l . We also align the supergatein level l with the one in level l − l − l in its firstrow. In this way, one data alignment copy is saved for each pair ofcascading supergates in the same y . At the end, for those supergatesin level l that were not placed in column l as a result of the describedprocess, we place them in the decreasing order of their size in thefirst available y in column l (this is done in Place_Supergates( l ) function in Algorithm 1 for each column l ).Using this placement scheme, supergates corresponding to S , and S , in Fig. 2 (a) are pushed up in their respective columns(i.e., S , and S , in Fig. 5 (a), respectively), which improves thearea efficiency. Furthermore, the output memristor of S , is sharedas one of inputs of S , , as they are cascaded supergates in thesame y position, which saves one data alignment copy. Note thatthe assumption for the underlying MCA dimension in this workis 1024 × Resource Sharing for Data Alignment Copies:

When theoutput of a supergate should be aligned to any row(s) of cascadedsupergates in the next logic level, two NOT copies are used for eachalignment with the aid of one auxiliary memristor. However, by

Figure 5:

Implemented 1-bit full adder using the HIPE-MAGIC: (a)tile-based mapping, (b) execution orders, and (c) MCA mapping. 10cycles of PIM operations and 22 memristors are needed. sharing the auxiliary memristor for alignment copies that mustbe carried out for the output of the supergate and doing the sec-ond NOT for each alignment from that memristor (represented by

Share_Align_Memristors() function in Algorithm 1), we reduce bothlatency and memristor count. In Fig. 5, we see an example for theadvantage of resource sharing for data alignment copies, as wereuse one of input memristors of S , for transferring output of S , to S , . We used ISCAS’85 and IWLS’93 benchmark suites for our perfor-mance evaluation. A Python script runs ABC by leveraging optimiz-ing scripts during synthesis and varying k values { k = 2,3,7,10}. Next,technology mapping is done according to the mapping scheme pre-sented in Section 3.2. Lastly, the design spaces for each benchmark(associated with different values of k ) is explored through a Paretoanalysis and the best solution for each benchmark in terms of thearea-latency trade-off is determined as the solution associated withthe HIPE-MAGIC. Note that the real-time control of the MCA im-plementing the logic function (i.e., a sequencer which activatesrows and columns of the MCA) is not considered. Also, it is worthmentioning that considering the electrical conditions of MCA suchas current sneak path is out of the scope of this paper. able 1: Comparison between the mapping metrics of HIPE-MAGIC with those of [20] and [21] on ISCAS’85 and IWLS’98 benchmarks.

Benchmark Circuit HIPE-MAGIC Tenace et al. [20] Thangkhiew and Datta [21]Cycles Mems Cycles Mems Speedup Area-Saving Cycles Mems SpeedUp Area-Saving I S C A S ’ c432

122 366 156 631 1.28 1.72 265 372 2.17 1.02 c499

253 836 420 1399 1.66 1.67 935 1085 3.70 1.30 c880

219 862 482 1113 2.20 1.29 750 886 3.42 1.03 c1355

253 836 554 1182 2.19 1.41 938 1076 3.71 1.29 c1908

313 809 627 1095 2.00 1.35 970 1061 3.10 1.31 c2670

332 1462 643 1249 1.94 0.85 1401 1915 4.22 1.31 c3540

758 2544 1566 3261 2.06 1.29 2418 2663 3.19 1.05 c5315 c6288 c7552

Average 723.2 1991.9 1283.6 2293.7 1.84 1.27 1974.7 2261.2 3.12 1.17 I W L S ’

99 154 160 1026 1.61 1.78 552 594 5.57 1.03 apex5

371 1910 777 2223 2.09 1.16 1966 2319 5.30 1.21 clip

54 310 135 451 2.50 1.45 239 261 4.42 0.84 duke2

263 1037 300 1632 1.14 1.57 1261 1342 4.80 1.29 inc

21 296 55 280 2.62 0.95 264 282 12.57 0.95 misex3c

183 1585 518 2551 2.83 1.61 2976 3094 16.26 1.95 rd73

27 232 150 379 5.55 1.63 262 280 9.70 1.21 sao2

54 344 79 559 1.46 1.62 309 331 5.72 0.96 vg2

27 172 55 280 2.04 1.63 289 345 10.70 2.01

Average 122.11 671.11 247.67 1042.33 2.43 2.03 902 983.11 8.34 1.59

We evaluate the proposed approach by calculating the numbersof compute cycles and memristors required for each benchmarkusing the proposed synthesis and mapping flow. Area and latencyimprovements of the HIPE-MAGIC with respect to some of thestate-of-the-art prior work are reported in TABLE 1. We compareHIPE-MAGIC with the mapping approach presented in [20], andthe work in [21] which presents a mapping procedure for a netlistconsisting of MAGIC NOR/NOT gates based on a simulated anneal-ing algorithm to minimize the number of computational cycles. Foreach MCA-based implementation, we compare the number of cyclesrequired to generate all primary outputs (Cycles in TABLE 1), andthe number of memristors required to implement the synthesizednetlist (Mems in TABLE 1).Results confirm that proposed optimization scripts leveraging thebalancing operations reduce the latency by offering more opportu-nities for executing multiple supergates of a logic level concurrently,while simultaneously reducing the number of logic levels. How-ever, the proposed scripts may add some redundancies in termsof logic and/or supergates to keep the logic level balanced, whichmay introduce area overheads. A particular mention is required forc2670 and c5315 where HIPE-MAGIC has a small area overheadcompared to [20]. It is worth mentioning that a favorable trade-offbetween speed and area can be obtained in HIPE-MAGIC by settinga different value of k . In addition, the dedicated proposed mappingflow compensates the potential area overhead due to redundancyintroduced by supergates, and also reduces the latency mainly bysharing resources for data alignment copies. In particular, HIPE-MAGIC outperforms [20] in terms of the latency by an average of 2.11x improvement across both benchmark suites, while using 1.37xfewer memristors. Concerning the meta-heuristic mapping solutionpresented in [21], HIPE-MAGIC is 3.12x (8.34x) faster, requiring1.17x (1.59x) fewer memristors for ISCAS’85 (IWLS’98) benchmarks. We analyze and quantify the strengths and weaknesses of our PIMrealization over conventional CPU and other PIM solutions from[17] and [14]. For our comparison, we employ a parameterizedanalytical modeling tool presented in [8] (and referred to as

Bitlet ),which evaluates the affinity of workloads to PIM versus CPU com-puting according to some PIM- and CPU-related parameters. PIMcomputations in the Bitlet model are limited mainly by operationcomplexity and data alignment costs, i.e., the number of PIM cyclesin Table 1. On the other hand, for CPU computing, Bitlet ignoresthe cost associated with computations and data movements donewithin the CPU itself (which is indeed in the favor of CPU comput-ing for our comparison), and the CPU throughput is thus mainlylimited by its external memory bandwidth utilization, i.e., the costassociated with the number of data bits transferred between theCPU and memory [8]. Bitlet uses technology parameters for a 28nmnode from [15] for the memory model in CPU and bases its modelof memristor on [12].Using the Bitlet model, Fig. 6 (a) shows a comparison in termsof achieved throughput between PIM solutions (including HIPE-MAGIC) and CPU computing. Notice that the PIM does not achieveany improvements in terms of throughput for c6288, because thiscircuit is an array multiplier with a relatively large number of PIMcycles. Furthermore, the comparison between PIM and CPU with re-spect to c2670, which has a relatively large number of input/outputlines (which affects the CPU throughput adversely), demonstrates a)(b) Figure 6:

Throughput and energy consumption comparisons be-tween CPU and PIM approaches. that PIM can greatly alleviate the overheads associated with datatransfers from/to the main memory.Comparing the energy obtained using the Bitlet model, CPUenergy consumption is higher than that of PIM solutions over allthe studied benchmarks (see Fig. 6 (b)). For the CPU, energy con-sumption is associated with the number of transferred bits andenergy per bit transfer, while operation complexity and PIM energyper each operation determine the energy consumption for PIMsolutions. As the number of PIM cycles increase, the relative energyefficiency of PIM decreases (e.g., c6288).These results demonstrate once again that PIM-based solutionsare beneficial if much of the latency and energy consumption as-sociated with a target application is due to data transfers betweenmemory and the computing fabric. There are many such interest-ing applications including for example big data analysis, neuralnetwork based inference, and so on.

In this paper, we presented a technology-aware synthesis approachalong with a heuristic mapping framework. Results show that theHIPE-MAGIC achieves promising results compared to state-of-the-art work and the conventional CPU computing. As data alignmentcopy operations contribute to significant number of cycles for im-plementing a logic function, restricting the number of fan-outs ofLUT nodes during logic synthesis can be a promising future work.

REFERENCES [1] Rajeev Balasubramonian et al. 2014. Near-data processing: Insights from aMICRO-46 workshop.

IEEE Micro (2014).[2] R. Ben-Hur et al. 2019. SIMPLER MAGIC: Synthesis and Mapping of In-MemoryLogic Executed in a Single Row to Improve Throughput.

IEEE Trans. CAD Integr.Circuits Syst. (2019). https://doi.org/10.1109/TCAD.2019.2931188[3] Julien Borghetti et al. 2010. ’Memristive’ switches enable ’stateful’ logic opera-tions via material implication.

Nature (2010).[4] Robert Brayton and Alan Mishchenko. 2010. ABC: An Academic Industrial-Strength Verification Tool. In

Computer Aided Verification . Springer.[5] S. Chakraborti et al. 2014. BDD based synthesis of Boolean functions usingmemristors. In

IDT . https://doi.org/10.1109/IDT.2014.7038601[6] Daichi Fujiki, Scott Mahlke, and Reetuparna Das. 2018. In-memory data parallelprocessor. In

ACM SIGPLAN Notices , Vol. 53. ACM, 1–14.[7] Mohsen Imani et al. 2019. FloatPIM: In-memory Acceleration of Deep NeuralNetwork Training with High Precision. In

ISCA .[8] Kunal Korgaonkar et al. 2019. The Bitlet Model: Defining a Litmus Test for theBitwise Processing-in-Memory Paradigm. arXiv preprint arXiv:1910.10234 (2019).[9] Shahar Kvatinsky et al. 2011. Memristor-based IMPLY logic design procedure. In

ICCD .[10] S. Kvatinsky et al. 2014. MAGIC-Memristor-Aided Logic.

TCAS-II: Express Briefs (2014). https://doi.org/10.1109/TCSII.2014.2357292[11] S. Kvatinsky et al. 2014. Memristor-Based Material Implication (IMPLY) Logic:Design Principles and Methodologies.

IEEE Trans. VLSI Syst. (2014).[12] Mario Lanza et al. 2019. Recommended methods to study resistive switchingdevices. 5, 1 (2019), 1800143.[13] W. Lu et al. 2011. Two-terminal resistive switches (memristors) for memory andlogic applications. In

ASP-DAC .[14] A. Mishchenko et al. 2011. Delay optimization using SOP balancing. In

ICCAD .[15] M. OâĂŹConnor et al. 2017. Fine-Grained DRAM: Energy-Efficient DRAM forExtreme Bandwidth Systems. In

MICRO .[16] Ghasem Pasandi and Massoud Pedram. 2019. Balanced Factorization and Rewrit-ing Algorithms for Synthesizing Single Flux Quantum Logic Circuits. In

GLSVLSI .https://doi.org/10.1145/3299874.3317967[17] Sudharsan Seshadri, Mark Gahagan, Sundaram Bhaskaran, Trevor Bunker,Arup De, Yanqin Jin, Yang Liu, and Steven Swanson. 2014. Willow: A User-Programmable { SSD } . In { USENIX } Symposium on Operating Systems Designand Implementation ( { OSDI } . 67–80.[18] Vivek Seshadri et al. 2017. Ambit: In-memory accelerator for bulk bitwise opera-tions using commodity DRAM technology. In MICRO .[19] N. Talati et al. 2016. Logic Design Within Memristive Memories Using Memristor-Aided loGIC (MAGIC).

IEEE Trans. Nanotechnol. (2016). https://doi.org/10.1109/TNANO.2016.2570248[20] V. Tenace et al. 2019. SAID: A Supergate-Aided Logic Synthesis Flow for Mem-ristive Crossbars. In

DATE . https://doi.org/10.23919/DATE.2019.8714939[21] Phrangboklang L. Thangkhiew and Kamalika Datta. 2018. Scalable in-memorymapping of Boolean functions in memristive crossbar array using simulatedannealing.

Journal of Systems Architecture (2018). https://doi.org/10.1016/j.sysarc.2018.07.002[22] Wm A Wulf and Sally A McKee. 1995. Hitting the memory wall: implications ofthe obvious.

ACM SIGARCH computer architecture news (1995).(1995).