[PDF] CONTRA: Area-Constrained Technology Mapping Framework For Memristive Memory Processing Unit

Abstract

Data-intensive applications are poised to benefit directly from processing-in-memory platforms, such as memristive Memory Processing Units, which allow leveraging data locality and performing stateful logic operations. Developing design automation flows for such platforms is a challenging and highly relevant research problem. In this work, we investigate the problem of minimizing delay under arbitrary area constraint for MAGIC-based in-memory computing platforms. We propose an end-to-end area constrained technology mapping framework, CONTRA. CONTRA uses Look-Up Table(LUT) based mapping of the input function on the crossbar array to maximize parallel operations and uses a novel search technique to move data optimally inside the array. CONTRA supports benchmarks in a variety of formats, along with crossbar dimensions as input to generate MAGIC instructions. CONTRA scales for large benchmarks, as demonstrated by our experiments. CONTRA allows mapping benchmarks to smaller crossbar dimensions than achieved by any other technique before, while allowing a wide variety of area-delay trade-offs. CONTRA improves the composite metric of area-delay product by 2.1x to 13.1x compared to seven existing technology mapping approaches.

Full PDF

CCONTRA: Area-Constrained Technology Mapping FrameworkFor Memristive Memory Processing Unit

Debjyoti Bhattacharjee

IMEC, Leuven, [email protected]

Anupam Chattopadhyay

School of Computer Science and Engineering,Nanyang Technological University, Singapore

Srijit Dutta

Samsung ElectronicsSamsung Digital City, South Korea

Ronny Ronen, Shahar Kvatinsky

Andrew and Erna Viterbi Faculty of Electrical EngineeringTechnion-Israel Institute of Technology, Haifa, Israel

ABSTRACT

Data-intensive applications are poised to benefit directly fromprocessing-in-memory platforms, such as memristive Memory Pro-cessing Units, which allow leveraging data locality and perform-ing stateful logic operations. Developing design automation flowsfor such platforms is a challenging and highly relevant researchproblem. In this work, we investigate the problem of minimizingdelay under arbitrary area constraint for MAGIC-based in-memorycomputing platforms. We propose an end-to-end area constrainedtechnology mapping framework, CONTRA. CONTRA uses Look-Up Table (LUT) based mapping of the input function on the crossbararray to maximize parallel operations and uses a novel search tech-nique to move data optimally inside the array. CONTRA supportsbenchmarks in a variety of formats, along with crossbar dimensionsas input to generate MAGIC instructions. CONTRA scales for largebenchmarks, as demonstrated by our experiments. CONTRA allowsmapping benchmarks to smaller crossbar dimensions than achievedby any other technique before, while allowing a wide variety ofarea-delay trade-offs. CONTRA improves the composite metric ofarea-delay product by 2 . × to 13 . × compared to seven existingtechnology mapping approaches. CCS CONCEPTS • Hardware → Memory and dense storage ; •

Software and itsengineering → Source code generation . KEYWORDS

In-memory computing, RRAM, Technology mapping, Design au-tomation flow, MAGIC operations

ACM Reference Format:

Debjyoti Bhattacharjee, Anupam Chattopadhyay, Srijit Dutta, and RonnyRonen, Shahar Kvatinsky. 2020. CONTRA: Area-Constrained TechnologyMapping Framework For Memristive Memory Processing Unit. In

IEEE/ACMInternational Conference on Computer-Aided Design (ICCAD ’20), November

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].

ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/3400302.3415681

The separation between the processing units and memory unitrequires data transfer over energy-hungry buses. This data transferbottleneck is popularly known as the memory wall . The overheadin terms of energy and delay associated with this transfer of datais considerably higher than the cost of the computation itself [20].Extensive research has been conducted to overcome the memorywall, ranging from the classic memory hierarchy to the close inte-gration of processing units within the memory [1, 22]. However,these methods still require transfer of data between the process-ing blocks and the memory, thus falling into the category of vonNeumann architectures.Processing data within the memory has emerged as a promis-ing alternative to the von Neumann architecture. This is generallyreferred to as

Logic-in-Memory (LiM). The primary approach toperform LiM is to store input variables or/and logic output in amemory cell. This is enabled when the physical capabilities of thememory can be used for data storage (as memory) and computa-tion (as logic). Various memory technologies, including ResistiveRAM (RRAM), Phase Change Memory (PCM), Spin-transfer torquemagnetic random-access memory (STT-MRAM) and others havebeen used to realize LiM computation [2, 8, 9, 12, 15, 16, 18].RRAM is one of the contending technologies for logic-in-memorycomputation. RRAMs permit stateful logic , where the logical statesare represented as resistive state of the devices and at the sametime, are capable of computation. Multiple functionally completelogic families have been successfully demonstrated using RRAMdevices [21]. In the following, three prominent logic families arepresented.

Material Implication Logic [18]: Consider two RRAM devices p and q with internal states S p and S q respectively, as shown in Fig. 1a.By applying voltages to the terminal, material implication can becomputed, with the next state (NS) of device p set to the result ofcomputation. NS p = S p → S q (1) Majority Logic [9]: In this approach as shown in Fig. 1b, the wordlinevoltage ( V wl ) and bitline voltages ( V bl ) act as logic inputs, whilethe internal resistive state S x of the device x acts a third input. Thenext state of device x in this case is a function of three inputs as a r X i v : . [ c s . A R ] S e p CCAD ’20, November 2–5, 2020, Virtual Event, USA Bhattacharjee et al. ... (a) (b) (c)

Figure 1: Logic primitives realized using memristors. (a) Material Im-plication (b) Majority logic (c) Memristor Aided Logic (MAGIC). shown below in the following equation. NS x = M ( S x , V wl , V bl ) (2) Memristor-Aided loGIC (MAGIC) [16]. MAGIC allows in-memorycompute operation by using the internal resistive state of single ormultiple RRAM devices as input. The exact number of inputs ( k )depends on the specific device used for computation. The result ofcomputation is written to a new device ( r ), as shown in Fig. 1c. Theinternal resistive state of the input devices remain unchanged. UsingMAGIC operations, multi-input NOR and NOT can be realized. NS r = NOR ( S i , S i , . . . , S ik ) (3) NS r = NOT ( S i ) (4)General purpose architectures have been proposed based onthese primitives. A bit-serial Programmable Logic in Memory (PLiM)architecture was proposed by Gaillardon et al. [9] that uses major-ity as the logic primitive. PLiM relied on using the same crossbarfor storage of instructions as well for computation. RRAM-basedVery long instruction word (VLIW) Architecture for in-MemorycomPuting (ReVAMP) was proposed by Bhattacharjee et al. [5], thatused Instruction Memory for the instruction storage and a separateRRAM crossbar as data storage and computation memory. Haj Aliet al. proposed memristive Memory Processing Unit (mMPU) [11].The mMPU consists of memristive memory arrays, along withComplementary Metal Oxide Semiconductor (CMOS) peripheryand control circuits to allow support for computations as well asconventional data read and write operations. To perform a compu-tation within the mMPU, a compute command is sent to the mMPUcontroller. The controller generates the corresponding control sig-nals and applies the signals to the crossbar array to perform theactual MAGIC operations. The mMPU allows MAGIC NOR andNOT gates to be executed within any part of the crossbar array,which allows storage of data as well as computation to happenin the same array. Compared to the architectures based on Mate-rial Implication, and Majority logic, MAGIC provides an inherentadvantage. For MAGIC, control signals are not dependent on theoutput of a compute operation, .Wider acceptance of these architectures and technologies crit-ically rely on efficient design automation flows, including logicsynthesis and technology mapping. In this paper, we focus on the technology mapping challenge for architectures supporting MAGICoperations . Intuitively, a Boolean function (represented using logiclevel intermediate form) is processed by the technology mappingflow to generate a sequence of MAGIC operations which are ex-ecuted on the limited area available on a crossbar. The numberof devices available for computation using MAGIC operations onthe mMPU is limited [17, 29], which makes the problem of tech-nology mapping even more challenging. This particular variant is known as area-constrained technology mapping problem (ACTMaP)for mMPU. Multiple technology mapping solutions for mMPU havebeen proposed in the literature [3, 14, 26–28, 30]. Almost all of theseworks focus delay reduction, only one [3] accepts a limited form ofarea constraints (limited row-size only) and considers device reuseto improve area efficiency.In this paper, we propose CONTRA – the first scalable area-constrained technology mapping flow for the LiM computing usingMAGIC operations. CONTRA not only allows specifying overallarea constraint (in terms of number of devices) but also the ex-act crossbar dimensions. This enables CONTRA to map the samefunction into say a 64 ×

64 or 128 ×

128 crossbar with differentdelays, whereas the existing methods cannot offer this flexibility.Specifically, our paper makes the following contributions: • We propose a scalable 2-dimensional area-constrained tech-nology mapping flow for the LiM computing using MAGICoperations. • We present novel algorithms, using NOR-of-NORs represen-tations (NoN) to place the LUTs on the crossbar to maximizeparallelism, while maintaining the area constraints. We usean optimal A* search technique for moving inputs to therequired position in the crossbar and propose an input align-ment optimization to reduce the number of copy operations. • We extensively evaluate our technique using various bench-marks. The overall flow achieves improvement in area-delayproduct from 2 . × to 13 . × in terms of geometric mean com-pared to seven existing technology mapping approaches forMAGIC. Our method can map arbitrary Boolean functionusing MAGIC operations to a smaller crossbar dimensionsthan achieved by any other technique before.CONTRA takes an input benchmark, processes it using the noveltechnology mapping flow to generates MAGIC instructions. Wedeveloped an in-house simulator for MAGIC to execute the in-structions and formally verify the functional equivalence of thegenerated instructions and the input benchmark. We present the basics of computing using MAGIC operations tobegin with. As shown in Fig. 2a, a 2-input MAGIC NOR gate con-sists of 2-input memristors (

I N and I N ) and one output memris-tor ( OUT ). The memristive state of the output memristor changesin accordance with the resistive states of the input memristors. Lowresistive state is interpreted as logical ‘1’ while high resistive stateis interpreted as logical ‘0’. The NOR gate operation is realized byapplying V G to the input memristors while the output memristoris grounded. Note that the output memristor has to be initialized tolow resistive state before the NOR operation is carried out. Afterapplying the voltage, the resistance of the output memristor is setbased on the ratio between the resistances of the input and theoutput memristors and results in a NOR operation. The MAGICNOR operation can be performed with the devices arranged in acrossbar configuration, as shown in the right hand side of Fig. 2a. Source code available: https://github.com/debjyoti0891/arche

ONTRA: Area-Constrained Technology Mapping Framework For Memristive Memory Processing Unit ICCAD ’20, November 2–5, 2020, Virtual Event, USA (a) MAGIC operations using memristors, which can be performed ina crossbar configuration. M M M M M M M M M M (b) Memristors arranged in acrossbar configuration. M M M M M M M M M M (c) Horizontal NOR M M M M M M M M M M (d) Vertical NOR M M M M M M M M M M (e) NOTFigure 2: Basic MAGIC operations on a crossbar array. By extending this approach, it is feasible to perform logical n -inputNOR and NOT operations.Multiple MAGIC operations can be performed in parallel. Theparallel execution of multiple NOR gates is achieved wheneverinputs and outputs of the n -input NOR gates are aligned in the samerows or columns in a crossbar, as shown in Fig. 2b. For example,Fig. 2c, two 3-input NOR operations are performed in parallel. M , = NOR ( M , , M , , M , ) M , = NOR ( M , , M , , M , ) Also, vertical operations are allowed as shown in Fig. 2d. M , = NOR ( M , , M , ) A single-input NOR operation is a NOT gate, as shown in Fig. 2e. M , = NOT ( M , ) Thus, both n -input NOR and NOT gates can be executed by MAGICoperations. It is also possible to reset the devices in parallel in thecrossbar to ‘1’, either row-wise or column-wise. For logic synthesis and technology mapping approaches, a classi-fication of different Intermediate Representations (IRs) has beenproposed in [24]. First, there are

Functional approaches, where theIR is used to explicitly express the logic function. Examples for IRsare Boolean truth tables, Look-Up Tables (LUTs) or Binary DecisionDiagrams (BDDs). Second, there are

Structural approaches, wherethe IR is used to represent the structure of the circuit, e.g., usingAnd-Inverter Graphs (AIGs). For technology mapping on memris-tive crossbar, both types of approaches have been adopted, as it fitsmore closely the device-level operations. Among the design automa-tion flows developed for memristive technologies, Majority-InverterGraphs (MIGs) are predominantly used due to their native map-ping on to devices supporting Majority Boolean functions [4, 23].MAGIC devices realize multi-input NOR operations, which do notallow a direct mapping from MIGs. Hence, in this work, we use LUTgraph and NOR-of-NOR representations for solving ACTMaP for

Network structure visualized by ABCBenchmark "CM151". Time was Tue Nov 26 22:39:40 2019.

The network contains 8 logic nodes and 0 latches. m n15-001 10-00 1 22-001 00-00 016-01 10-0 1 19-01 10-0 1 k l17-10 11-1 1 18-10 11-1 1 j20-10 11-1 1 21-10 11-1 1g h ie f c d a b

Figure 3: cm151a benchmark partitioned into LUTs with k = . Eachtriangular node represents a primary input, while the inverted tri-angle represent primary outputs. Each round node represents a LUT.LUT id and their functionality in SoP is shown inside the node. mMPU. The rationale for using LUT graph is that it allows mappingto all forms of Boolean functions [27]. LUT graph:

Any arbitrary Boolean function can be representedas a directed acyclic graph (DAG) G = ⟨ V , E ⟩ , with each vertexhaving at most k -predecessors [25]. Each vertex v , v ∈ V , with k -predecessors represents a k -input Boolean function or simply a k -input LUT. Each edge, u → v represents a data dependency fromthe output of node u to an input of node v . Example 2.1.

Fig. 3 shows the cm151a benchmark from LGSynth91 as aDAG with k = . The benchmark has 12 primary inputs a − l and two primaryoutputs m and n . LUT has a dependency on LUTs and and on primaryinput j . We use this benchmark as a running example to explain the proposedmethod. NOR-of-NOR representation : A Boolean function F : B n →B , expressed in sum-of-products (SoP) form can be converted tothe NOR-of-NORs (NoN) representation by the following simpletransformations.(1) Replace ∨ and ∧ operations with ∨ (2) Flip the polarity of each primary input(3) Negate the resultFor example, we can express F in NoN representation as follows. F = ( a ∧ b ) ∨ ( a ∧ b ∧ c ) = ( a ∨ b ) ∨ ( a ∨ b ∨ c ) (5)Alternatively, we can express this NoN as:-Variables a b c Multiple works address the issue of design automation for computa-tion with bound on the number of memristive devices. Lehtonen etal. presented a methodology for computing arbitrary Boolean func-tions using devices that realize material implication [18]. For anyBoolean function with n inputs and m outputs, m + n -input Booleanfunction with a single output, three working memristors are suf-ficient for computation. This bound was further reduced to twoworking memristors by Poikonen et al. [19]. Optimal and heuristicsolutions for ACTMaP for devices realizing majority with singleinput inverted have been proposed in [6]. Crossbar-constrainedACTMaP solution has been proposed for devices realizing majoritywith single input inverted in [7]. CCAD ’20, November 2–5, 2020, Virtual Event, USA Bhattacharjee et al.

Benchmark LUTPlacement LUT InputPlacementk RxC

ABC

LUT GraphGeneration Input Alignment ofstacked LUTs MAGICinstructionsspacing Veriﬁcation

ABC

A* search mMPUsimulator

ABC

Figure 4: CONTRA: area-Constrained Technology mapping fRAmework for Memristive Memory Processing Unit

As mentioned before, several technology mapping methods formMPU have been proposed in literature [3, 14, 26–28, 30]. Thesemethods primarily work towards reducing latency for mappingan arbitrary function and output the dimensions of the crossbarrequired to map the function. While trying to maximize parallelism,these methods often map to highly skewed crossbar dimensions(where number of rows is much higher than number of columnsor vice versa). Furthermore, this methods are highly area ineffi-cient since they do not reuse devices, leading to very low deviceutilization. To our knowledge, SIMPLER [3] is currently the onlymethod for mMPU that is optimized for area. SIMPLER relies onmapping functions to a single row, with the objective of achiev-ing high throughput by simultaneously executing multiple datastreams in different rows. As SIMPLER allows device reuse, it hashigh area utilization. However, the utility of this method is limitedas all the used devices must still be allocated in a single memoryrow and it cannot use 2-dimensional crossbar for mapping in orderto fit a function into a small crossbar. We address the challenge of2-dimensional constrained mapping.

In this section, we describe CONTRA, a 2-dimensional area-ConstrainedTechnology mapping fRAmework for memristive memory process-ing unit, which is shown in Fig. 4.

The goal of this phase is to map the individual nodes (LUTs) ofthe input DAG on the crossbar, so as to minimize the delay ofcomputing. LUTs in the same topological level of the DAG do nothave any dependencies between themselves and therefore, couldbe scheduled in parallel. In order to permit computation of multipleLUTs in parallel, we utilize the NOR-of-NOR representation of theLUT function.Since the NoN representation consists of only NOR and NOToperations, it can be computed by MAGIC operations directly in3 cycles, ignoring the initialization cycle(s). All the variables inappropriate polarity (inverted or regular) in a product term arealigned in rows. For the variables which are not present in a productterm, the corresponding memristor is set to ‘1’, which is the state ofthe memristor after reset. This is followed by computing NOR of allthe product terms horizontally in a single cycle. In the next cycle, avertical NOR of the above results produces the negated output. Inthe last cycle, we negate this result to get output of the computedfunction.

Example 3.1.

The computation of F in equation (5) using MAGIC oper-ations is shown Fig. 5. Row 1 and row 2 have the inputs for the 1st and 2ndproduct terms respectively. These inputs are NORed in parallel to compute H1H2

H1H2 H1H2Inputplacement ParallelHNOR Computing using VNOR Computing using NOT

Figure 5: Computation of F with 3 inputs and 2 product using MAGICoperations on a × crossbar. H1H2

17⎯ ⎯⎯⎯

H3H4

18⎯ ⎯⎯⎯

H5H6

20⎯ ⎯⎯⎯

H7H8

21⎯ ⎯⎯⎯

H1H2H3H4

LUT 17 and 18 placed Crossbar reset (Col 4 and 8 blocked)LUT 20 and 21 placed

H1H2

17⎯ ⎯⎯⎯

H3H4

18⎯ ⎯⎯⎯

H5H6H7H8

17⎯ ⎯⎯⎯ 20⎯ ⎯⎯⎯18⎯ ⎯⎯⎯ 21⎯ ⎯⎯⎯

Figure 6: LUT Placement phase on a × crossbar for the cm151abenchmark. the product terms with the outputs written to M , (H1) and M , (H2). Theproduct terms are vertically NORed to compute F in M , . In the final step, F is inverted using a NOT operation to compute F (in M , ). The LUTs are topologically ordered and grouped by the numberof inputs. The LUTs are placed one below another with inputsaligned till we are limited by the height of the crossbar. Consider n -LUTs each with k -inputs. Once the LUTs are aligned one belowanother, we can compute the horizontal NOR of all LUTs in onecycle. This is because the inputs and outputs of all the LUTs arealigned and the voltage of the columns applies to all LUTs. Inthe next n -cycles, we can perform the vertical NOR operations tocompute the inverted output of the n stacked LUTs. Thus, ( n + ) cycles are required to compute the n stacked LUTs. Let us considerthat each k -input LUT L i has p i product terms, 1 ≤ i ≤ n . Then,the area L narea required to compute the n LUTs in parallel is :- L narea = n (cid:213) i = ( p i + ) × ( k + ) (6)The LUT placement strategy is from top to bottom and from leftto right. The spacinд parameter is used to specify the number ofrows that are left empty between two LUTs stacked vertically. If wedo not have enough free devices to place a new LUT, the crossbaris scanned row-wise and column-wise to check in which rows orcolumns, the intermediate results are present. These are consid-ered blocked and the rest of the crossbar is reset either row-wiseor column-wise, which results in lesser number of devices beingblocked. . The process is repeated till all the LUTs are placed. Theoverall flow is presented in Algorithm 1. Example 3.2.

For cm151a, we stack the LUTs and in the crossbar, asshown in Fig. 6. Since enough space in not available vertically, we stack LUTs and on the right. We reset the crossbar, without resetting column and , as these columns contain the intermediate results. We continue placing theother LUTs in similar manner. ONTRA: Area-Constrained Technology Mapping Framework For Memristive Memory Processing Unit ICCAD ’20, November 2–5, 2020, Virtual Event, USA → (3,4) ][ LUT 18 (4,1) → (6,4) ]2 [LUT 20 (1,5) → (3,8) ][ LUT 21 (4,5) → (6,8) ]3 [Reset columns except {4,8}]4 . . . Note that we are effectively computing the inverted output ofeach LUT. Therefore, for the output LUTs, an additional NOT oper-ation is required, as specified in lines 17-19 of Algorithm 1.

Algorithm 1:

Area-constrained technology mapping.

Input : G , R , C , spacinд Output :

Mapping of G to crossbar R × C . do L set = Pick LUTs in a topological level with equal number of inputs. if limited by space vertically then Start placing from next available column ; end if limited by both vertical and horizontal space then Reset the cells keeping the intermediate outputs intact. end Place L set stacked together vertically with spacinд rows empty betweensubsequent LUTs. Schedule all the LUTs in L set in the same time slot of the schedule. while There is a LUT not yet placed. ; for Each set of LUTs stacked together do Place the inputs for these LUTs, using A* search and vertical copies; Compute intermediate results in parallel using Horizontal NORs.; Compute inverted output of LUTs in sequence using Vertical NORs.; end for Each inverted output of G do Invert using NOT operation to compute outputs of G . end For some of the LUTs„ we require the intermediate outputs fromprevious computations as inputs. We use A ∗ search to get the short-est path to copy an intermediate value from source ( R S , C S ) todestination ( R D , C D ) with a minimum number of NOT operations.The cost of a location cost ( r , c ) is f ( r , c ) + д ( r , c ) . f ( r , c ) is equalto the number of copy operations used to reach from ( R S , C S ) till ( r , c ) . д ( r , c ) =  , if ( r , c ) is the destination1 , if r == R D or c == R C , otherwise (7)All empty cells in the row and column of the current location areconsidered its neighbours. The search starts at the source, updatesthe cost of the neighbouring location and picks the location withthe least cost . The process is repeated till the goal state is reached.If the path length is odd, the polarity of the input is reversed whilefor an even length path, the polarity is preserved. This is due to anodd or even number of NOT operations respectively. If the inputs ofa NoN has only positive or negative terms, but not both, we need tochoose the copy path to be even or odd accordingly. If the inputs areof mixed polarity, we can choose the path with shorter length, thepolarity does not matter. Thereafter, the input variable is verticallycopied to different rows as required for the other product terms inthe LUT, according to the NoN representation. Example 3.3.

LUT 16 uses the output of LUT 17 as input, with the NoNrepresentation shown in Fig. 7. We copy the value from M , to M , using asequence of NOT operations, obtained using A ∗ search.

17⎯ ⎯⎯⎯

17 j1820 j21 - 1 0 H91 - 1 H10

16⎯ ⎯⎯⎯ - 1 0 H111 - 1 H12

19⎯ ⎯⎯⎯

17 - 1 H2

17⎯ ⎯⎯⎯

Copying LUT17 as input to LUT16

H1H2

17⎯ ⎯⎯⎯

H3H4

18⎯ ⎯⎯⎯

NoN for LUT 16 and 19

H5H6

20⎯ ⎯⎯⎯

H7H8

21⎯ ⎯⎯⎯ - 1 0 H9H10

16⎯ ⎯⎯⎯ - 1 0 H111 - 1 H12

19⎯ ⎯⎯⎯

H1H3H4 H5H6

20⎯ ⎯⎯⎯

H7H8

21⎯ ⎯⎯⎯ 17⎯ ⎯⎯⎯

17 0 j H2

17⎯ ⎯⎯⎯

18 H9H10

16⎯ ⎯⎯⎯ - 21 ⎯⎯ H1120 - j H12

19⎯ ⎯⎯⎯ ⎯⎯

17 20

21⎯ ⎯⎯⎯20⎯ ⎯⎯⎯

18⎯ ⎯⎯⎯

H5H6

20⎯ ⎯⎯⎯

H7H8

21⎯ ⎯⎯⎯

18⎯ ⎯⎯⎯

Crossbar state after input placement

Figure 7: Placement of the inputs for LUTs 16 and 17 and the corre-sponding literals for NOR-of-NOR computation.

NOT( M , → M , ), NOT( M , → M , ), NOT( M , → M , ) The state of the crossbar after placing all the inputs (LUTs 17, 18, 20 and 21,primary inputs i and j) for LUT and is shown in the last sub-figure ofFig. 7. Multiple LUTs scheduled together for execution, often share com-mon inputs. If the common inputs are assigned to the same column,then only a single A ∗ search would be required to bring the input tothe column, and followed by vertically copying to the appropriaterows. This would lead to reduction in delay as well as reduction inthe number of devices involved in copying. The goal is to have anassignment of the inputs of the individual LUTs to columns suchthat it maximizes the number of aligned inputs in a set of stackedLUTs.We encode the constraints of this problem to optimally solve theproblem using an Satisfiability Modulo Theories (SMT) solver. ▷ Maximize (cid:205) kc = (cid:205) nli = (cid:205) nl j = aliдn c , li , l j ▷ A c , l = v | ∃ v ∈ input of LUT l . 1 ≤ c ≤ k and 1 ≤ l ≤ n . ▷ aliдn c , li , l j = A c , li = A c , l j . 1 ≤ c ≤ k , 1 ≤ li ≤ n and1 ≤ lj ≤ n .The assignment to variable A c , l = v indicates a variable v is as-signed to column c of LUT l . For n LUTs each with k inputs, a bruteforce approach would have time complexity of ( k ! ) n − . As the SMTsolver takes a long time to solve and have to be executed multipletimes in mapping a benchmark, we propose a greedy algorithm forfaster mapping.Consider k -input LUTs and n of these LUTs stacked together. Thiscan be represented as a matrix with dimensions n × k , where eachrow represents the inputs variables of the LUT. As the inputs of anLUT are unique, each variable occurs at most once in each row of thematrix. The detailed alignment approach is shown in Algorithm 2.We explain the algorithm with a representative example. Example 3.4.

Consider the three 4-input LUTs with their input variablesarranged as an unaligned matrix, as shown below. The variables are ordered indescending order by frequency. L = {a:3, b:2, c:2, d:1, e:1, h:1, g:1, x:1}. We startthe alignment by placing ‘a’ in the first column. In the next step, we place ‘b’.As row 1 and 2 of column 1 are already occupied by ‘a’, we place ‘b’ in column2. Similarly, we continue the process until all the variables are placed.

Unaligned Step 1 Step 2 ...

Aligned a b c d a ϕ ϕ ϕ a b ϕ ϕ a b c db c e a a ϕ ϕ ϕ a b ϕ ϕ a b c eh a g x a ϕ ϕ ϕ a ϕ ϕ ϕ a h g x Example 3.5.

For the LUTs 16 and 19, the result of alignment is shown infirst sub-figure of Fig. 7, specified by variables in pink. The variables 17, 18 andj are assigned columns , and for LUT 17 while the variables 20, 21 and jare assigned columns , and for LUT 18, thereby aligning input variable j . CCAD ’20, November 2–5, 2020, Virtual Event, USA Bhattacharjee et al.

Algorithm 2:

Input Alignment

Input : M Output : M aliдn L = Ordered List of variables in the matrix in descending order by count. M aliдn = initialize n × k matrix with ϕ ; for variable v in L do R v = { r if v ∈ row r of M } ; tarдet c = None; for col c in matrix M do if M aliдn [ r ][ c ] == ϕ | ∀ r ∈ R v then tarдet c = c ; break; end end if tarдet c == None then Place v in any free column in each row ∈ R v ; else Place v in column tarдet c in each row ∈ R v ; end end return M aliдn ; This completes the description of the technique for area-constrainedmapping. The output of mapping cm151a benchmark to 8 × k = spacinд = T1 INPUT (0, 0) ~pi T2 INPUT (3, 0) ~pi ...T9 T11 HNOR (0, 0) (0, 1) (0, 2) (0, 3) hNold_n18_ | HNOR ... | T12 VNOR (0, 3) (1, 3) (2, 3) old_n18_ |T13 VNOR (3, 3) (4, 3) (5, 3) old_n19_ |T14 reset 0 1 2 4 5 6 7 | c T15 INPUT (0, 0) ~pi T16 INPUT (3, 0) ~pi ...T33 COPY (5, 3) (5, 0) old_n19_ T34 COPY (5, 0) (5, 1) ~old_n19_ ...T69 VNOR (0, 4) (1, 4) (2, 4) pm | T70 VNOR (3, 4) (4, 4) (5, 4) (6, 4) pn | T71 NOT (2, 4) (2, 0) pm | NOT (6, 4) (6, 0) pn |

Figure 8: Snippet of MAGIC instructions generated by CONTRA onmapping cm151a benchmark on × crossbar with k = and spac-ing=0. This section presents the experimental results of the CONTRA, theproposed area-Constrained Technology mapping fRAmework forfor computing arbitrary functions using MAGIC operations. Wehave implemented the proposed CONTRA framwork using Python.CONTRA supports a variety of input formats for the benchmarks,including blif , structural verilog , aig . We have used ABC [25] forall generating the LUT graph and the SOP representation of LUTfunctions, which we converted to NoN representation for mapping.For each benchmark, CONTRA generates cycle accurate MAGICinstructions. A representative output of mapping is shown in Fig. 8.We developed an in-house mMPU simulator for executing MAGICinstructions. We used the simulator to generate execution traceswhich were converted into Verilog. The generated Verilog and theinput benchmarks were formally checked for functional equivalenceusing the cec command of ABC.We benchmark our tool using the ISCAS85 benchmarks [13],which have been used extensively for evaluation of automation Table 1: Benchmarking results for the ISCAS85 benchmark for threecrossbar sizes. We ran each benchmark with k = { , , } and spacingset to { , , , } . For each benchmark, the best results were obtainedfor k = and spacing set to . (R,C) (64,64) (128,64) (128,128)Bench PI PO Cycles Cycles Cycles c432 36 7 797 774 770c499 41 32 1391 1341 1343c880 60 26 1314 1268 1263c1355 41 32 1390 1341 1344c1908 33 25 1511 1470 1469c2670 233 140 2132 2066 2060c3540 50 22 3751 3575 3575c5315 178 123 5022 4827 4831c6288 32 32 8176 7890 7881c7552 207 108 7308 7039 7036 Table 2: Benchmarking results for the EPFL MIG benchmarks forthree crossbar sizes. We ran each benchmark with k = { , , } andspacing set to . For each benchmark, the best results were obtainedfor k = . (R,C) (64,64) (128,64) (128,128)Bench PI PO Cycles Cycles Cycles arbiter 256 129 81941 81582 81434cavlc 10 11 3808 3672 3686ctrl 7 26 786 759 757dec 8 256 1399 1253 1284i2c 147 142 6698 6656 6692int2float 11 7 1369 1340 1323priority 128 8 5479 5398 5389router 60 30 1150 1121 1153voter 1001 1 - 68777 68758 Table 3: Benchmarking results for the EPFL arithmetic benchmarksfor × crossbar size. Bench PI PO LUTs k Spacing Cycles adder 256 129 339 4 6 4398bar 135 128 1408 4 6 12216div 128 128 57239 2 6 342330hyp 256 128 64228 4 - -log2 32 32 10127 4 1 128647max 512 130 1057 4 6 9468multiplier 128 128 10183 3 0 90925sin 24 25 1915 4 6 21761sqrt 128 64 8399 4 6 101694square 64 128 6292 4 0 74614 flows for MAGIC. The experiments were run on a shared clusterwith 16 Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz, with RedHat Enterprise Linux 7. Table 1 shows the results of mapping thebenchmarks for three crossbar dimensions. We report the executiontime in seconds for 128 ×

128 for the ISCAS85 benchmarks. Wereport the results for the best delay (in cycles) by varying k from2 to 4. As expected, the increase in crossbar dimensions results inlower delay of execution. We also report the results of mappingfor the EPFL benchmarks . We report the results for EPFL MIGbenchmarks in Table 2 for three crossbar dimensions. For the largerEPFL arithmetic and random control benchmarks, we report theresults for crossbar with 256 ×

256 dimensions in Table 3 and Table 4respectively.We observe that for most of the results, the best delay was ob-tained for k =

4. This is because setting a higher value of k, leadsto fewer LUTs in the LUT graph. Since multiple LUTs can be sched-uled in parallel (based on constraints mentioned in Algorithm 1),this leads to reduction in the number of cycles to compute thebenchmark by exploiting higher degree of parallelism. For largebenchmark such as voter in Table 2 and very small crossbar dimen-sion (64 , https://github.com/lsils/benchmarks ONTRA: Area-Constrained Technology Mapping Framework For Memristive Memory Processing Unit ICCAD ’20, November 2–5, 2020, Virtual Event, USA

Table 4: Benchmarking results for the EPFL control benchmarks for × crossbar size, with spacing set to 6. We ran each benchmarkwith k = { , , } and the best results were obtained for k = . Benchmark PI PO LUTs Cycles ac97_ctrl 2255 2250 3926 27742comp 279 193 8090 74379des_area 368 72 1797 17273div16 32 32 2293 22047hamming 200 7 725 9414i2c 147 142 423 3133MAC32 96 65 3310 40007max 512 130 1866 16072mem_ctrl 1198 1225 3031 22021MUL32 64 64 2758 31344pci_bridge32 3519 3528 23257 110318pci_spoci_ctrl 85 76 446 3621revx 20 25 3056 31603sasc 133 132 204 1476simple_spi 148 147 305 2307spi 274 276 1581 13115sqrt32 32 16 989 11326square 64 127 6083 67602ss_pcm 106 98 159 968systemcaes 930 819 3207 26981systemcdes 314 258 1128 9468tv80 373 404 3044 25986usb_funct 1860 1846 5265 41029usb_phy 113 111 187 1156 intermediate results which does not leave enough number of freedevices to map the rest of the LUTs.

Spacing is the number of rows that is left free between two LUTsstacked vertically, as described in Algorithm 1. We analyze the im-pact of spacing on three large benchmarks for ISCAS85, k = ×

64 and 128 × Fig. 11 shows the impact of crossbar dimensions on delay of map-ping, while keeping the number of devices ( R × C ) constant. Weconsidered k = { , , } , spacinд = { , , , } and three large bench-marks for ISCAS85 benchmarks. The best delay for all the bench-marks were obtained for k = ×

64 to 2048 × Fig. 12 shows the overhead of copy operations as a percentage. Asevident from the Fig. 12, copy operations constitute a large overheadin the computation of a benchmark. As we use A* search algorithmto align the inputs, the exact number of copy operations used inalignment is optimal. However in order to limit run time, we donot try and scheduling multiple copy operations in parallel, con-sidering multiple source and destination locations simultaneously.This could be investigated in future, at the cost of higher executiontime of the search algorithm.

The existing technology mapping approaches for MAGIC do notconsider area constraints in mapping and focus only on minimiz-ing the delay. Given a benchmark, the existing methods report thecrossbar dimensions required to map the benchmark, along with thedelay of mapping. These works therefore cannot map benchmarksto arbitrary sized crossbar arrays. For comparison, we determinethe smallest crossbar dimension for which the mapping was feasibleusing CONTRA. In the absence of area constraints, our methodcan achieve delay identical to SAID (E7) [27], since both CONTRAand SAID use LUT based mapping. CONTRA requires significantlower area to map in comparison to existing methods, while havingrelatively higher delay. As none of the methods support area con-straints, we use Area-Delay Product (ADP) as a composite metricfor direct comparison.

ADP = R × C × Cycles (8)Improvement in

ADP = ADP Ei ADP

CONT RA (9)The list of existing works we compare CONTRA to follows: • E • E • E , E • E , E • E CCAD ’20, November 2–5, 2020, Virtual Event, USA Bhattacharjee et al. (a)

Spacing 0 2 4 6 8 × c3540 4006 3760 3702 3761 3813c5315 5354 4952 4963 5032 5108c7552 8009 7348 7187 7327 ×× × c3540 3814 3664 3639 3585 3614c5315 5071 4836 4795 4828 4804c7552 7807 7141 7052 7038 7035 (b) D e l a y ( C y c l e s ) s0 s2 s4 s6 s8 Figure 10: Impact of spacing parameters on delay for three benchmarks, consideringtwo crossbar dimensions × and × , with k = . D e l a y ( C y c l e s ) Figure 11: Impact of crossbar dimensions ondelay of mapping, while keeping the num-ber of devices constant.Table 5: Comparison of CONTRA with existing works. Note that the delay for the existing works do not consider placement overhead of primaryinputs. R = Number of Rows, C = Number of Columns, k = Number of inputs to generate LUT Graph.

Proposed E1 [10] E2 [31] E3 [28] E4 [28] E5 [30] E6 [30] E7 [27]

Bench k RxC Cycles RxC Cycles RxC Cycles RxC Cycles RxC Cycles RxC Cycles RxC Cycles RxC Cycles c432 3 20x12 824 146x9 349 22x42 225 62x11 265 51x47 342 36x150 338 69x13 290 36x84 156c499 3 20x16 1140 323x13 1155 96x44 242 73x37 935 83x55 1059 45x182 903 116x31 707 144x28 420c880 3 32x22 1389 383x5 761 67x39 427 124x30 750 103x73 913 69x73 726 107x14 613 100x53 482c1355 3 36x16 1092 359x10 1072 96x63 236 72x43 938 91x55 1060 49x163 825 103x28 757 128x37 554c1908 3 32x22 1489 312x13 1056 83x85 517 60x60 970 70x66 1075 42x88 928 93x33 648 69x54 627c2670 4 38x34 2267 664x9 1490 66x92 551 301x45 1401 385x245 1495 202x137 1278 340x29 1183 355x33 643c3540 4 60x26 3726 650x16 2396 137x164 1435 153x150 2418 160x161 2589 71x221 2007 109x55 1761 234x77 1566c5315 4 64x48 5365 1261x11 3295 221x136 1361 298x73 3239 449x179 3382 249x122 2676 547x22 2251 441x42 1754c6288 4 32x30 8744 2297x6 3776 151x870 3751 436x98 5007 265x265 5515 33x892 3161 49x115 3104 510x226 4069c7552 4 64x48 8009 845x14 3929 214x175 2182 321x320 3824 381x379 4012 220x57 3031 542x22 2486 416x79 2565

GeoMean Reduction (Area): × × × × × × × GeoMean Overhead (Delay): × × × × × × ×

21% 49% 38% 49% 26% 33% 21% 28% 35% 17%57% 43%

43% 58% 45% 62% 56% 54% 70%0%10%20%30%40% O v e r h e a d ( C y c l e s ) input copy Figure 12: Overhead of primary input placement and copying inter-mediate results for LUT input. c432 c499 c880 c1355 c1908 c2670 c3540 c5315 c6288 c7552 I m p r o v e m e n t i n A D P ( l o g s c a l e ) Benchmarks

E1 E2 E3 E4 E5 E6 E7

GeoMean Improvement (ADP):

E1: 3.7 E2: 3.1 E3: 5.6 E4: 13.1 E5: 6.5 E6: 2.1 E7: 4.1

Figure 13: Comparison of the ADP of CONTRA with existing works,along with Geometric Mean (GeoMean) of improvement in ADP ofCONTRA over existing works. as the A* search technique can be used for optimally movingthe intermediate results to any desired location.We present the comparison results in Table 5. The main obser-vations are (1) CONTRA requires less crossbar area compared toall other methods. (2) Not only the total area is smaller, but thesize of each dimension is smaller which makes mapping of logicinto memory significantly more feasible. (3) Unfortunately, thesebenefits come with a slightly higher delay. None of the previousworks on technology mapping for MAGIC consider the overheadof placing the primary inputs on the crossbar [10, 27, 28, 30, 31].However, we considered the cost of placing the primary inputsin all our mapping results. From Fig. 12, we can observe that theoverhead of input in terms of number of cycles could be as highas 49% for smaller benchmarks. This strongly suggests that the overhead of input placement must be considered during mapping.Therefore, comparing our proposed method directly in terms ofdelay with existing works is unfair.In Fig. 13, we plot the improvement in ADP for individual testcases from the ISCAS85 benchmarks. Barring two cases (c432 for E2and c880 for E6), there is a considerable improvement in ADP for theproposed algorithm for all the benchmarks against all the existingimplementations. We present the geometric mean of improvementin ADP of CONTRA over the existing methods. CONTRA achievesthe best geometric mean improvement of 13 . × over E4. From theFig. 13, we can also rank existing methods on the basis of their ADP.After CONTRA, E6 has the next best ADP, followed closely by E1and E2, followed by E7, whereas E3, E4 and E5 are significantlyworse. Unlike MAGIC operations where all the inputs are represented asstate of memristors, Majority operations also use the bitline andwordline inputs as inputs, alongside the internal resistive state Z of the ReRAM which acts as third input and the stored bit. Us-ing majority operations, ReVAMP architecture was proposed byBhattacharjee et al. [5]. ReVAMP supports two type of instructions.Apply instructions compute on the cells of a wordline. Read instruc-tion reads the internal state of a word onto a data-memory registerby using sense amplifiers, that can be used as input to subsequentApply instructions. In case of MAGIC, read operations are not usedduring in-memory operations.For the sake of completeness, we compare CONTRA againsta recently proposed area-constrained mapping approach ArC forReVAMP [7]. The results of comparison are shown in Table 6. CON-TRA achieves better delay compared to ArC, whereas requiringlarger number of memristors to map the benchmarks. It should ONTRA: Area-Constrained Technology Mapping Framework For Memristive Memory Processing Unit ICCAD ’20, November 2–5, 2020, Virtual Event, USA

Table 6: Comparison of CONTRA against ArC for ReVAMP [7].

Bench RxC Overhead Cycles Speedup

ADPcurrADPArC c432 8x14 2.1 1654 2.0 1.1c499 8x14 2.9 2450 2.1 1.3c880 8x14 6.3 2569 1.8 3.4c1355 8x14 5.1 2460 2.3 2.3c1908 12x14 4.2 2774 1.9 2.2c2670 16x16 5.0 4307 1.9 2.7c3540 18x24 3.6 7152 1.9 1.9c5315 26x24 4.9 8005 1.5 3.3c6288 16x24 2.5 14871 1.7 1.5c7552 20x24 6.4 11079 1.4 4.6 be noted that the delay for ArC is equal to the number of cyclesrequired for computes and reads. Also, ReVAMP uses an externalinternconnect network for alignment of inputs, which does notcontribute to the number of cycles but in practice would implyhigher controller energy. In case of MAGIC, alignment operationsare done inside the crossbar itself, which leads to higher delay andmore number of memristors being used for the COPY operations.

In this work, we presented the first area-constrained technologymapping flow for LiM using MAGIC operation on a crossbar array.We provide a scalable approach to solve the problem that tries tomaximize parallelism. We introduce an optimal search algorithmfor alignment of variables between two locations in a crossbar. Weunlock the possibility of mapping Boolean functions to a wide vari-ety of crossbar dimensions using MAGIC operations. The proposedalgorithm outperforms state-of-the-art technology approaches forMAGIC in terms of ADP. Evidently from our comparative studies,existing design automation flows for in-memory computing plat-forms are far from capturing the nuances of practical constraints. Toalleviate this problem, we will apply our flow on actual design pro-totypes and come up with more rigorous benchmarks with detailedcharacterization.

REFERENCES [1] Shaizeen Aga, Supreet Jeloka, Arun Subramaniyan, Satish Narayanasamy, DavidBlaauw, and Reetuparna Das. 2017. Compute caches. In . IEEE, 481–492.[2] Amogh Agrawal, Akhilesh Jaiswal, Chankyu Lee, and Kaushik Roy. 2018. X-sram: Enabling in-memory boolean computations in cmos static random accessmemories.

IEEE Transactions on Circuits and Systems I: Regular Papers

65, 12(2018), 4219–4232.[3] Rotem Ben-Hur, Ronny Ronen, Ameer Haj-Ali, Debjyoti Bhattacharjee, AdiEliahu, Natan Peled, and Shahar Kvatinsky. 2019. SIMPLER MAGIC: Synthesis andMapping of In-Memory Logic Executed in a Single Row to Improve Throughput.

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (2019).[4] Debjyoti Bhattacharjee, Luca Amaŕu, and Anupam Chattopadhyay. 2018.Technology-aware logic synthesis for ReRAM based in-memory computing. In

DATE . 1435–1440.[5] Debjyoti Bhattacharjee, Rajeswari Devadoss, and Anupam Chattopadhyay. 2017.ReVAMP: ReRAM based VLIW architecture for in-memory computing. In

DATE .782–787.[6] Debjyoti Bhattacharjee, Arvind Easwaran, and Anupam Chattopadhyay. 2017.Area-constrained Technology Mapping for In-Memory Computing using ReRAMdevices. In .[7] Debjyoti Bhattacharjee, Yaswanth Tavva, Arvind Easwaran, and Anupam Chat-topadhyay. 2020. Crossbar-constrained technology mapping for reram basedin-memory computing.

IEEE Trans. Comput.

69, 5 (2020), 734–748.[8] E. Linn, R. Rosezin, S. Tappertzhofen, U. Böttger and R. Waser. 2012. Beyondvon Neumann-logic operations in passive crossbar arrays alongside memoryoperations.

Nanotechnology

23, 30 (2012). https://doi.org/10.1088/0957-4484/23/30/305205[9] P. E. Gaillardon, L. Amarú, A. Siemon, E. Linn, R. Waser, A. Chattopadhyay, andG. De Micheli. 2016. The Programmable Logic-in-Memory (PLiM) computer. In

DATE . 427–432.[10] Rahul Gharpinde, Phrangboklang Lynton Thangkhiew, Kamalika Datta, andIndranil Sengupta. 2017. A scalable in-memory logic synthesis approach usingmemristor crossbar.

IEEE Transactions on Very Large Scale Integration (VLSI)Systems

26, 2 (2017), 355–366.[11] Ameer Haj-Ali, Rotem Ben-Hur, Nimrod Wald, Ronny Ronen, and Shahar Kvatin-sky. 2018. Not in name alone: A memristive memory processing unit for realin-memory processing.

IEEE Micro

38, 5 (2018), 13–21.[12] Said Hamdioui, Lei Xie, Hoang Anh Du Nguyen, Mottaqiallah Taouil, KoenBertels, Henk Corporaal, Hailong Jiao, Francky Catthoor, Dirk Wouters, LinnEike, et al. 2015. Memristor based computation-in-memory architecture fordata-intensive applications. In

DATE . EDA Consortium, 1718–1725.[13] Mark C Hansen, Hakan Yalcin, and John P Hayes. 1999. Unveiling the ISCAS-85benchmarks: A case study in reverse engineering.

IEEE Design & Test of Computers

16, 3 (1999), 72–80.[14] Rotem Ben Hur, Nimrod Wald, Nishil Talati, and Shahar Kvatinsky. 2017. SIMPLEMAGIC: synthesis and in-memory mapping of logic execution for memristor-aided logic. In

Proceedings of the 36th International Conference on Computer-AidedDesign . 225–232.[15] Sandeep Kaur Kingra, Vivek Parmar, Che-Chia Chang, Boris Hudec, Tuo-HungHou, and Manan Suri. 2020. SLIM: Simultaneous Logic-in-Memory ComputingExploiting Bilayer Analog OxRAM Devices.

Scientific reports

10, 1 (2020), 1–14.[16] Shahar Kvatinsky, Dmitry Belousov, Slavik Liman, Guy Satat, Nimrod Wald, Eby GFriedman, Avinoam Kolodny, and Uri C Weiser. 2014. MAGICâĂŤMemristor-aided logic.

IEEE Transactions on Circuits and Systems II: Express Briefs

61, 11(2014), 895–899.[17] Chia-Fu Lee, Hon-Jarn Lin, Chiu-Wang Lien, Yu-Der Chih, and Jonathan Chang.2017. A 1.4 Mb 40-nm embedded ReRAM macro with 0.07 um 2 bit cell, 2.7mA/100MHz low-power read and hybrid write verify for high endurance appli-cation. In . IEEE, 9–12.[18] Eero Lehtonen and Mika Laiho. 2009. Stateful implication logic with memristors.In

NanoArch . IEEE Computer Society, 33–36.[19] Eero Lehtonen, JH Poikonen, and Mika Laiho. 2010. Two memristors suffice tocompute all Boolean functions.

Electronics letters

46, 3 (2010), 239–240.[20] Ardavan Pedram, Stephen Richardson, Mark Horowitz, Sameh Galal, and ShaharKvatinsky. 2016. Dark memory and accelerator-rich system optimization in thedark silicon era.

IEEE Design & Test

34, 2 (2016), 39–50.[21] John Reuben, Rotem Ben-Hur, Nimrod Wald, Nishil Talati, Ameer Haj Ali, Pierre-Emmanuel Gaillardon, and Shahar Kvatinsky. 2017. Memristive logic: A frame-work for evaluation and comparison. In . IEEE, 1–8.[22] Vivek Seshadri, Donghyuk Lee, Thomas Mullins, Hasan Hassan, AmiraliBoroumand, Jeremie Kim, Michael A Kozuch, Onur Mutlu, Phillip B Gibbons, andTodd C Mowry. 2017. Ambit: In-memory accelerator for bulk bitwise operationsusing commodity DRAM technology. In

Proceedings of the 50th Annual IEEE/ACMInternational Symposium on Microarchitecture . 273–287.[23] S. Shirinzadeh, M. Soeken, P. Gaillardon, and R. Drechsler. 2018. Logic Synthesisfor RRAM-Based In-Memory Computing.

IEEE Transactions on Computer-AidedDesign of Integrated Circuits and Systems

37, 7 (2018), 1422–1435.[24] Mathias Soeken and Anupam Chattopadhyay. 2016. Unlocking efficiency andscalability of reversible logic synthesis using conventional logic synthesis. In

IEEETransactions on Nanotechnology

15, 4 (2016), 635–650.[27] Valerio Tenace, Roberto G Rizzo, Debjyoti Bhattacharjee, Anupam Chattopad-hyay, and Andrea Calimera. 2019. SAID: A Supergate-Aided Logic Synthesis Flowfor Memristive Crossbars. In . IEEE, 372–377.[28] Phrangboklang L Thangkhiew and Kamalika Datta. 2018. Scalable in-memorymapping of Boolean functions in memristive crossbar array using simulatedannealing.

Journal of Systems Architecture

89 (2018), 49–59.[29] Xiaoyong Xue, Wenxiang Jian, Jianguo Yang, Fanjie Xiao, Gang Chen, ShuliuXu, Yufeng Xie, Yinyin Lin, Ryan Huang, Qingtian Zou, et al. 2013. A 0.13 Âţm8 Mb Logic-Based Cu x Si y O ReRAM With Self-Adaptive Operation for YieldEnhancement and Power Reduction.

IEEE Journal of solid-state circuits

48, 5(2013), 1315–1322.[30] Dev Narayan Yadav, Phrangboklang L Thangkhiew, and Kamalika Datta. 2019.Look-ahead mapping of Boolean functions in memristive crossbar array.

Integra-tion

64 (2019), 152–162.[31] Alwin Zulehner, Kamalika Datta, Indranil Sengupta, and Robert Wille. 2019. Astaircase structure for scalable and efficient synthesis of memristor-aided logic.In