[PDF] A Graph Partitioning Algorithm with Application in Synthesizing Single Flux Quantum Logic Circuits

Abstract

In this paper, a new graph partitioning problem is introduced. The depth of each part is constrained, i.e., the node count in the longest path of the corresponding sub-graph is no more than a predetermined positive integer value p. An additional constraint is enforced such that each part contains only nodes selected from consecutive levels in the graph. The problem is therefore transformed into a Depth-bounded Levelized Graph Partitioning (DLGP) problem, which is solved optimally using a dynamic programming algorithm. As an example application, we have shown that DLGP can effectively generate timing-correct circuit solutions for Single Flux Quantum (SFQ) logic, which is a magnetic-pulse-based, gate-level pipelined superconductive computing fabric. Experimental results confirm that DLGP generates circuits with considerably lower path balancing overheads compared with a baseline full-path-balancing approach. For example, the balancing overhead (a critical measure of quality metric) for the SFQ circuit realization in terms of D-Flip-Flop count is reduced by 3.61 times on average for 10 benchmark circuit, given p=5.

Full PDF

11 A Graph Partitioning Algorithm with Application inSynthesizing Single Flux Quantum Logic Circuits

Ghasem Pasandi and Massoud PedramDepartment of Electrical Engineering-SystemsUniversity of Southern California, Los Angeles, CA 90089.

Abstract —In this paper, a new graph partitioning problem isintroduced. The depth of each part is constrained, i.e., the nodecount in the longest path of the corresponding sub-graph is nomore than a predetermined positive integer value p . An additionalconstraint is enforced such that each part contains only nodesselected from consecutive levels in the graph. The problem istherefore transformed into a Depth-bounded Levelized GraphPartitioning (DLGP) problem, which is solved optimally using adynamic programming algorithm. As an example application, wehave shown that DLGP can effectively generate timing-correctcircuit solutions for Single Flux Quantum (SFQ) logic, whichis a magnetic-pulse-based, gate-level pipelined superconductivecomputing fabric. Experimental results conﬁrm that DLGP gen-erates circuits with considerably lower path balancing overheadscompared with a baseline full-path-balancing approach. Forexample, the balancing overhead (a critical measure of qualitymetric) for the SFQ circuit realization in terms of D-Flip-Flopcount is reduced by . × on average for 10 benchmark circuit,given p = . I. I

NTRODUCTION

Graph partitioning (GP) consists of dividing nodes of agraph into smaller (typically, equal size) parts to minimizea cost function subject to some constraints. GP plays animportant role in many different applications including parallelprocessing, processing complex networks, image processing,and VLSI design [1]. Partitioning large VLSI circuits hasan important impact on placement, routing, and testability ofthose circuits. A good partitioning can result in lower totalwire-length, which has direct impact on reducing the criticalpath delay of the circuit and total area of the chip. Moreover, inhardware simulation and test, a good partitioning solution canreduce the number of required multiplexers for passing inter-block signals to the bus architecture of the hardware simulator[2]. For GP problems in VLSI, a circuit graph is generated byconsidering gates or modules as nodes of the graph and wiresconnecting them as edges of the said graph.In this paper, we present the Depth-bounded LevelizedGraph Partitioning (DLGP) problem as a new GP problem andprovide optimal solution for it using a Dynamic Programming(DP) algorithm. Thanks to DLGP, an important problem inthe design ﬂow of the superconductive, Single Flux Quantum(SFQ) circuits is addressed. SFQ gates with switching delayof ps and energy consumption of − J per switchingare considered as potential candidates for achieving superhigh-performance and ultra energy-efﬁcient systems [3]–[5].Despite the superiority of SFQ gates in achieving super fastand ultra energy-efﬁcient circuit realization, there are a fewchallenges in their design ﬂow. In a SFQ design most circuit elements (including all logiccells) are clocked elements, i.e., each logic cell becomes apipeline stage. For correct operation of a SFQ logic cell, it isnecessary for different inputs of the cell to arrive at the cellinput at the same clock cycle. Hence, when different inputstake different path lengths to traverse they must be explicitlypath balanced using D-ﬂip-ﬂops (DFFs). In this paper, we willaddress path balancing overhead as an important challenge inthe SFQ circuit design process.II. P RELIMINARIES

A. Background on Graph PartitioningDeﬁnition : Given a graph G = ( V, E ) , with non-negativeedge weights, w : E → R + , and a size s ( v ) for each vertex v ∈ V , the graph partitioning problem (GPP) is deﬁned asdividing set V into subsets V , V , ..., V K such that Eqs. 1, 2hold and an objective function (read below) is minimized. V ∪ V ∪ ... ∪ V K = V (1) V i ∩ V j = ∅ ∀ i (cid:54) = j (2)The above problem is called K-way partitioning as well. Sizeof part i is denoted by | V i | , i.e. (cid:80) v ∈ V i s ( v ) = | V i | . Thebounded-size GPP is deﬁned as a GPP problem in which sizeof the i th part is bounded by B i ( | V i | ≤ B i ). A special caseis balanced partitioning , where the size of all parts should beequal modulo a correction factor. More precisely, Eq. 3 shouldhold: ∀ i ∈ { , , ..., K } , | V i | ≤ (cid:15)K × | V | (3)This problem is commonly denoted as a ( K , (cid:15) )-balancedpartitioning problem. If (cid:15) = 0 , the problem is called perfectpartitioning, and the special case of K = and (cid:15) = is called the minimum bisectioning .As mentioned before, in GPP an objective function shouldbe minimized. The most used objective function for GPP isthe total cut size , which is deﬁned by the following equation: (cid:88) e ∈ C w ( e ) (4) C = { e = < u, v > ∈ E : u ∈ V i , v ∈ V j , i (cid:54) = j } Level of a node n in a graph G is deﬁned as the length of thelongest path in terms of the node count from primary inputsof G to node n ; if nodes of the graph are logical gates, itis called the logic level. Depth of a graph is deﬁned as thehighest level among all nodes in the graph. Depth of a part in a r X i v : . [ c s . ET ] S e p IN CLK OUT L S J IN I b LJ S J OUTJ L R CLK J R INCLKOUT

Fig. 1: Schematic of an SFQ inverter and its waveforms. a partitioning problem is the difference between the highestand the lowest levels among nodes of this part plus 1.

B. Background on SFQ

SFQ gates are pulsed-based and the presence and absenceof a pulse are considered as “1” and “0”, respectively. A pulseis a single quanta of magnetic ﬂux ( Φ = h/ e = . mV × ps )with a duration of a few ps and amplitude of a few mV . In thefollowing, some key properties of SFQ circuits are explained.

1) Gate-level pipeline:

SFQ gates (except for conﬂuencebuffers, splitters, I/O cells, and T-Flip-Flops) need to receivea clock signal and their operation is synchronized by the clock.Fig. 1 shows the circuit diagram of an SFQ inverter and thecorresponding waveform to show its functionality. As seen,after the clock pulse comes, when there is no input pulse(which means “0”), a pulse is generated at the output of thegate representing a “1”. On the other hand, when there is aninput pulse, no pulses are generated at the output, meaning a“0”.

2) Path Balancing:

In standard SFQ circuit design, toguarantee the correct operation, all fanins of a gate shouldhave the same logic level. Otherwise, some path balancingDFFs should be inserted into shorter paths. This is called pathbalancing. Here is an example of necessity of path balancingin SFQ circuits.

Example 1 : suppose that there is a digital signal a i =1010...10and we want to AND it with invert of another digital signal b i =0101...01. The correct output is: x1010...10. The ﬁrst bit inthe output is not valid because in the ﬁrst clock, second inputof the AND gate ( in ) is unknown. Without path balancing,generated values at the output of the AND gate will bex0000...00 which is not correct (Fig. 2a). The error occurredbecause the signals on in are one level behind the signalon in . By inserting a path balancing DFF, all fanins of theAND gate will have the same logic level. In the path balancedcircuit, as shown in Fig. 2b, the correct sequence of bits aregenerated at the output of the AND gate.

3) D-Flip-Flops (DFFs) in SFQ:

There are two typesof DFFs in SFQ: Destructive Read Out (DRO), and NonDestructive Read Out (NDRO). In DROs, after reading theinternal data of the DFF, it will be destroyed and cannot beread until another value is written into it. In NDRO, readoperation will not destroy the stored data in the DFF.III. P

RIOR W ORK

The balanced min-cut k-way partitioning problem is NP-complete [6], [7]. It remains NP-complete even for K = and x0000...00 clk ab out clk Error!in in (a) x1010...10 clkab out clk CorrectDFFclk in in (b)Fig. 2: Gate-level schematic of Example 1 for showing the necessityof path balancing for correct operation in SFQ circuits. with identical vertex size and unit edge weight [6], [8]. Theﬁrst well-known heuristic for solving 2-way balanced parti-tioning problem is Kernighan-Lin heuristic (K-L) [9]. In K-L,a pairwise interchange process is performed by exchangingvertex pairs that yield the largest decrease in the cut-size.The exchanged vertices are locked and the process continuesuntil all vertices are ﬁxed. A well-known modiﬁcation of K-L heuristic is presented by Fiduccia and Mattheyses in [10].Fiduccia and Mattheyses heuristic (F-M) handles unbalancedparts, supports hyper-edges, and has a linear run-time intotal number of circuit pins. After F-M, many other paperspresented effective partitioning heuristics using simulated an-nealing [11], multilevel approach [12], network ﬂow [13], [14],spectral methods [15], [16] or a uniﬁed approach by combininga few methods [14], [17].In [18], a module labeling and clustering algorithm ispresented which is known as Lawler’s clustering algorithm .In Lawler’s clustering algorithm, an optimal solution for treeclustering to minimize delay with constraints on cluster sizeand the maximum pin number of each cluster is given. A unitydelay model is used for each cluster and it is assumed thatdelays of interconnects are zero. Lawler’s algorithm providesoptimal solution for DAGs if replication of modules areallowed. Rajmohan et al. presented a clustering algorithm forminimizing the delay in combinational circuits [19]. UnlikeLawler’s algorithm, in [19] a general delay model is usedand an optimal solution subject to area capacity constraintsis given. In [1], [2], [20], [21], good surveys on GP and itsapplications are given.Different from other GPPs, in Depth-bounded LevelizedGraph Partitioning (DLGP) problem, there is a constraint onthe depth of each part while balancing each part in terms oftotal size is not of an interest. In the standard deﬁnition, DLGPis considered as un-balanced partitioning. However, since inDLGP the depth of parts should be the same, it can be called asdepth-balanced or depth-limited partitioning. In this context,the standard balanced partitioning problem can be called size- balanced partitioning problem.IV. P

ROBLEM F ORMULATION AND P ROPOSED A LGORITHM

Depth-bounded Levelized Graph Partitioning (DLGP) prob-lem deﬁnition:

Given a directed acyclic graph G = ( V, E ) , amapping function Λ which speciﬁes the level of each node in V to be between 1 and a maximum value of L (i.e., the longestnode distance from any source node of G ), and a positiveinteger p , partition set V into K parts V , V , ..., V K , eachof which giving rise to an induced sub-graph G , G , ..., G K ,such that (i) the depth of each sub-graph is no more than p ,and (ii) if a node of level l is included in some part V k , thenall nodes of level l also belong to part V k . Furthermore, thetotal cut size (TCS) as deﬁned below is minimized: T CS = L − (cid:88) n =1 cut size ( ) (5)where P re ( n ) denotes all nodes i in G belonging to parts V (cid:48) n such that n (cid:48) < n , P ost ( n ) denotes all nodes j in G belongingto parts V (cid:48)(cid:48) n such that n (cid:48)(cid:48) ≥ n . denotes the edge separator between levels n - and n in G i.e., the set of edges in G which originate at any node i in P re ( n ) and terminate at any node j in P ost ( n ) . cut size (

calculates the number of edges in G thatexist between P re ( n ) and P ost ( n ) sets.Next, we deﬁne a weighted directed chain graph C =( U, F, w ) with nodes labeled ...L where L is equal to thedepth of graph G = ( V, E ) . Each such node represents allnodes of G which are at the same level. There will be adirected edge uv in F between nodes u and v only when v = u + . Weight of the incoming edge to node v accounts fortotal number of edges connected to any node with level ≥ v from nodes with level < v . More precisely, if v = u + , theweight w uv of the uv edge is deﬁned as the number of directededges in G that connect any nodes in { P re ( u ) , u } and anyother nodes in P ost ( v ) . If v (cid:54) = u + , w uv = .Note that the above deﬁnition and weight assignment func-tion work equally well for hyper-graphs and hyper-edges(simply add “hyper” before any occurrence of “graph” or“edge”.) A directed hyper-edge is one with a distinguishedconnected node called a source node, thereby, establishing aclear sense of directionality between the source node and allother sink nodes connected by the hyper-edge. Depth-bounded Chain Graph Partitioning (DCGP) problemdeﬁnition : Given a weighted chain graph C = ( U, F, w ) and apositive integer p , partition set U into K parts U , U , ..., U K such that (i) the depth of each part is no more than p , (ii) thetotal cut weight as deﬁned below is minimized: T CW = K − (cid:88) k =1 cut weight ( ) (6)where P re ( k ) denotes all nodes i belonging to parts U (cid:48) k in C such that k (cid:48) < k , P ost ( k ) denotes all nodes j belongingto parts U (cid:48)(cid:48) k in C such that k (cid:48)(cid:48) ≥ k . denotes the edge separator of P re ( k ) and P ost ( k ) in C i.e., the set of edges in C which originate at any node k (cid:48) in P re ( k ) and terminate at any node k (cid:48)(cid:48) in P ost ( k ) . cut weight ( ) is calculated as the sumof the edge weights w uv of every distinct pair of nodes, u ∈ P re ( k ) and v ∈ P ost ( k ) . Lemma 1 : DCGP and DLGP are equivalent problems, i.e.solving the DCGP problem yields the solution of the DLGPproblem and vice versa.

Proof : It is enough to show that a solution to DCGPproblem can be transformed into a valid solution for DLGPproblem and vice versa. The proof is straight-forward andfollows easily from the way the chain graph and

P re and

P ost sets are deﬁned. It is omitted here to save space. (cid:4)

Note that when p is equal to or larger than the depth ofgraph G , then there will be only one part equal to G itselfand the problem is trivial. Lemma 2:

The DCGP problem can be solved optimallyusing the DP algorithm.To use DP for ﬁnding the optimal solution for the DCGPproblem, we deﬁne O ( i ) as the partitioning solution for a sub-graph G i of the original graph G which minimizes the totalcut weight as deﬁned by Eq. 6. Notice that G i is an inducedgraph obtained from G by including all nodes of G with levelsless than or equal to i . O ( L ) denotes the optimal solution forour problem. We initialize OP T ( i ) = for all ≤ i ≤ p . Nextthe value of the optimal solution, OP T ( i ) , which is deﬁnedas the minimum value for T CW for induced graph G i , iscalculated recursively as follows: OP T ( i ) = min q { OP T ( i − q )+ (7) cut weight ( ) } f or ≤ q ≤ p Proof of optimality : It should be shown that the optimalsolution of a subset of problem O ( i ) is built of optimalsolutions for its sub-problems. For this purpose, we use theinduction hypothesis as follows: suppose that the i th instanceof the problem with optimal solution O ( i ) has a sub-problem i − q with optimal solution O ( i − q ) and optimal value of OP T ( i − q ) = M . Suppose that O ( i ) is built of a solutionfor ( i − q ) th sub-problem with value M (cid:48) > M . Let’s call thissolution O (cid:48) ( i − q ) . Now, we can generate another solution forthe i th instance of the problem by replacing O (cid:48) ( i − q ) with O ( i − q ) . Since M < M (cid:48) , then the new solution for the i th instance of the problem is better than the ﬁrst one which isa contradiction, because the ﬁrst solution was supposed to bethe optimal solution. Therefore, the optimal solution for i th instance of the problem is built of the optimal solutions forits sub-problems (cid:4) . Theorem : The DLGP problem can be solved optimally.

Proof : Using lemma 1 and lemma 2, the proof is straight-forward.After ﬁnding the optimal solution, parts can be generated bytracing the O ( L ) solution back as follows: Generate an emptyset of selected levels N sel . Add the indices of sub-problems of O ( L ) to N sel , i.e. if j = i - q yields the minimum value for O ( L ) in Eq. 7, add j to N sel . Repeat these steps for O ( j ) , and traceall the way back to reach the boundary sub-problems. At theend, we will have N sel = { m , m , ..., m K − } . Having N sel ,the i th sub-set of nodes, V i , corresponding to the i th part inDLGP problem is obtained using the following equation: V i =  { v ∈ V | < Λ( v ) ≤ m } : i = 1 { v ∈ V | m i − < Λ( v ) ≤ m i } : 1 m K − } : i = K (8)in which, Λ( v ) returns the level of node v . Algorithm 1 showsthe pseudo code of DLGP. Complexity of line 1 and also line4 are O ( m + n ) , where n is the node count and m is the edge Algorithm 1:

DLGP

Input: G = ( V, E ) , p : constraint on depth,a mapping function Λ which returns level of a node in G . Output:

An optimal set P = { V , V , ..., V K } of parts. L = Compute Graph N odeDepth ( G ) . if p ≥ L then return P = { V } Generate the weighted directed chain graph C = ( U, F, w ) . for i =1; i ≤ p ; i++ do OP T ( i ) = 0 for i =1; i ≤ L ; i++ do Find O ( i ) and calculate its value, OP T ( i ) , using Eq.7. Find N sel = { m , m , ..., m K − } by tracing back fromthe O ( L ) solution. for i =1; i ≤ K ; i++ do Find V i using Eq. 8 return P = { V , V , ...V K } .count. Complexity of lines 7-8 is O ( p × L ) based on Eq. 7.Complexity of lines 10-11 is O ( n ) , because we go through allnodes only one time and put them in a part based on Eq. 8.Therefore, the overall complexity of the DLGP algorithm is O ( m + n ) . Example 2 : For the graph shown in Fig. 3a with depth L = ,by having p = , K = , and using hyper-edges for calculatingweights, the corresponding weighted directed chain graph willbe as shown in Fig. 3b. Using the DLGP algorithm, theselected levels will be N sel = { , } , and the sub-set of nodescorresponding to optimal parts will be V = { v , v , v , v } , V = { v , v } , and V = { v , v , v , v } . Please note thatsince hyper-edges are used, the edge weights for the weighteddirected chain graph will be { , , , } as shown in Fig. 3binstead of { , , , } , which is for the case of using regularedges in weight calculations.V. R EDUCING P ATH B ALANCING O VERHEAD IN

SFQC

IRCUITS

Evaluation of SFQ gates is destructive with respect to anyinternal state (loop current state) of the gate and any incominginput pulses. In other words, after an SFQ gate receives a clockpulse to produce its output, any stored internal state of the gateis destroyed and the input pulse is consumed. For example, if a2-input AND gate receives a 1 pulse on its in before the clocksignal arrives, it will store the said input pulse as a persistentcurrent in one of its internal loops. Next, if the gate receives asecond 1 pulse on its in again before the clock comes, it willstore this input value in a second internal loop as a persistentcurrent. Finally, when the clock input to the gate arrives, bothloop currents will reset (revert back to the other direction ofcurrent ﬂow) and an output pulse is reproduced to signify the1 output for the AND gate. Now consider a situation in which in arrives in clock cycle 1 whereas in arrives in clock cycle2. One would expect that the AND gate will produce a 0 atthe end of clock cycle 1 and a 1 at the end of clock cycle v v v v v v v v v V V V v (a) C C =6w =3 w =2 w =3 (b)Fig. 3: (a) Graph of Example 2, (b) Corresponding weighted directedchain graph. Cuts C and C generate three parts V , V , and V withminimizing the inter-part net weight.

2. However, this is not the case. In SFQ logic, the AND gatewill receive and consume input pulse on in during clock cycle1 producing a 0 output and then it will receive and consumeinput pulse on in in cycle 2, again producing a 0 output. Thisis precisely why the full path balancing method is employedto make sure that the AND gate will produce the correct 1pulse output at the end of clock cycle 2.Our key observation is that to avoid path balancing DFFs,all we have to do is to make sure that the producer of the 1pulse on in produces that same pulse in both clock cycles 1and 2. Therefore the AND gate will produce a wrong valueof 0 at the end of clock cycle 1 but the correct value of 1 atthe end of clock cycle 2. So, as long as we initiated new datatoward the AND gate with a slow clock frequency which ishalf of the fast clock frequency used to clock the AND gate,then the AND gate will produce the correct output at multiplesof the slow clock.To the best of our knowledge, this is the very ﬁrst paperthat makes this observation and uses it to effectively eliminatepath balancing DFFs inside an SFQ logic circuit although themethod comes at the expense of using two different clocks(micro and macro clocks) and a number of NDRO (output-replicating or repeating) DFFs. These NDRO DFFs are readby micro clock and are written by macro clock. Since they arebeing read by micro clock, in each cycle of the micro clockthe correct pulses will be re-generated and put on the primaryinputs of each part. Please note that DRO DFFs cannot be usedhere, because the read operation in DRO DFFs is destructive,hence, they cannot re-generate the correct pulses for p times.As explained above and in Section II-B2, for correctoperation of SFQ circuits, full path balancing is required.One way of addressing the path balancing problem is toadd as many DFFs as required to remove any differencesamong levels of inputs to any SFQ gate. This approach iscalled Full Path Balancing (FPB) . In [22], it is suggested toapply the standard retiming algorithm [23] after a heuristicFPB algorithm to minimize the number of path-balancing clk

DFF (a)

Micro_Clk

DFFMacro_Clk (b)Fig. 4: An example of (a) full path balancing (FPB), and (b) DLGP-based dual clocking method (DCM) (with p = ) for SFQ circuits. FPBuses DRO DFFs whereas the DCM uses NDRO DFFs. FPB requires9 DRO DFFs and DCM requires 5 NDRO DFFs. DFFs (called FPB+retiming). In spite of this algorithm, ourexperiences show that the FPB+retiming will add a largenumber of DFFs to the circuit which can dominate the originalgate count in the network even considering the fact that thearea cost of an SFQ DFF is somewhat less than the area costof say 2-input SFQ AND gate [24]. In this section, we willshow how DLGP algorithm helps solving this problem.As hinted earlier, we propose to use a fast micro and aslow macro clock and to use the DLGP algorithm to minimizethe aforementioned overheads of FPB. Thanks to the DLGPalgorithm, it is possible to divide the corresponding graph ofa given SFQ circuit into a few depth-limited parts and addNDRO DFFs only on the hyper-edges which are cut by variouspart boundaries. These DFFs will pass values that go fromone part to the other one with the macro clock, while gatesinside each part operate with the micro clock. Since the DLGPalgorithm guarantees giving the minimum total cut weight,the number of inserted NDRO DFFs in the circuit will beminimized. Furthermore, since the NDRO DFFs which areplaced at the inputs of each part are also clocked by the microclock to continuously reproduce their outputs, there is no needto add any path balancing DFFs inside each part. The resultingSFQ circuit is thus functionally pipelined allowing a number of K data instances to exist in the circuit at the same time, eachbeing operated in the corresponding part of the K -part circuit.The number of parts will thus affect the total operationalthroughput of the circuit (when there are no pipeline installs).Fig. 4 shows FPB, and DLGP-based Dual Clocking Method(DCM) for an example circuit. As seen, the DLGP-based DCMrequires 4 fewer number of DFFs compared with FPB. Notethat although NDRO DFFs are more expensive than the DRODFFs, the reduction in total DFF count far outweights thisdifference in element cost. See experimental results.In our DLGP-based DCM, the depth of each part is at most p . Since gates in each part are evaluated by the micro clock in out Macro_ClkMicro_Clk splitters

NDRO

Fig. 5: Pulse-repeating gate: an SFQ gate consists of an NDRO DFF,an AND gate, and two splitters to give the macro and micro clocksto these gates. and the speed of micro clock is p times faster than the macroclock, between two consecutive edges of the macro clock, p different values will hit the inputs of DFFs at the inter-partboundaries. If every such value reaches the NDRO DFFs, thiswill cause wrong values to be stored in the NDRO DFFs,which will be passed to the next part. Indeed we want thatonly the value which is generated in the last cycle of the microclock is written into the NDRO DFFs, because only this valueis valid. This issue can easily be addressed using the followingpulse-repeating gate and by ensuring that the macro clock issynchronized with the micro-clock but has a clock frequencywhich is p times slower.

1) Pulse-Repeating Gate:

We have invented an SFQ pulse-repeating gate shown in Fig. 5 to address the aforesaidchallenges in the DLGP-based DCM. In this gate, an ANDgate is added to the input of the NDRO DFF gate. One of theinputs of this AND gate is connected to the macro clock whichis “0” while the correct value is not generated on the “in” port.Therefore, it will pass a “0” to the input of the NDRO DFFas desired. Only in the last cycle of the micro clock the ANDgate will be transparent and pass the valid value to the NDRODFF. In addition, due to usage of NDRO DFF, inputs of eachpart will be re-generated (repeated) at each cycle of the microclock.Before passing these circuits to the DLGP-based DCM, theyare passed to the technology mapping engine of ABC [25] andits default optimization are performed on them. After that,mapped circuits are passed to the DLGP algorithm to ﬁnd theoptimum places for inserting pulse-repeating gates. After thisstep, splitter insertion and balancing of Primary Outputs (POs)are performed. Algorithm 2 shows the pseudo code for DLGP-based DCM. In line 1, the given circuit is mapped using thetechnology mapping engine of ABC. In line 2, the optimalparts are determined. In line 3, the pulse-repeating gates areinserted on the hyper-edges which are cut. Line 4 takes care ofbalancing POs, and ﬁnally, line 5 inserts the required splitters.VI. E

XPERIMENTAL R ESULTS

We implemented the DLGP algorithm inside ABC. An SFQlibrary of gates as in [24] is used. This library consists of thefollowing gates: and2 with 12 JJs, or2 with 8 JJs, xor2 with8 JJs,

DFF with 7 JJs, splitter with 3 JJs,

JTL with 2 JJs, and not with 9 JJs. A few benchmark circuits from ISCAS [26],EPFL [27], and some other arithmetic circuits are chosen totest the effectiveness of the proposed algorithm in reducingthe overhead of FPB.

Algorithm 2:

DLGP-based Dual Clocking Method

Input: a graph G = ( V, E ) corresponding to the inputcircuit, p : constraint on depth Output:

A timing-correct circuit represented by graph G (cid:48) = ( V (cid:48) , E (cid:48) ) G m = T echnology M apping ( G ) P arts = { V , V , ..., V k } = DLGP ( G m , p ) P ulse Repeating ( G m , P arts ) P O Balancing ( G m ) G (cid:48) = Splitter Insertion ( G m ) return G (cid:48) Tables I, II show experimental results for DLGP-basedDCM with two values for p , 10 and 5. For a better comparison,two baselines are considered; Baseline1 is the FPB, and

Baseline2 is the FPB+retiming algorithm. As seen, DLGP-based DCM provides substantial savings in total number ofJosephson junctions (

Area and are area and JJ count for gates,DFFs and splitters. The overhead of AND gates (in pulse-repeating gates) and second clock in DLGP-based DCM areconsidered in the experimental results of DLGP. i MCNC benchmark circuit con-sumes . × and . × fewer p = . For the same, DLGP-based DCM provides . × and . × improvements ontotal area, and . × and . × improvements on p = the amount of improvements are less than these values.For example, for the same circuit, the saving of . × and . × on area is reduced to . × and . × , and thesaving on . × and . × to . × and . × all compared with Baseline1 and Baseline2,respectively.On average for all 10 benchmark circuits, the saving on area, p = is 89%,77%, and . × , respectively over Baseline1 and 30%, 23%,and . × , respectively over Baseline2. The reason behindnot seeing the huge saving of c benchmarkcircuit, the run-time is decreased by . × and . × when p = compared with Baseline1 and Baseline2, respectively. Themain reason behind larger run-time for baselines is require-ment of inserting many DFFs plus performing retiming, whichboth are slow processes specially for large benchmark circuits.We tried to extract experimental results for larger benchmarkcircuits (larger than voter with 13758 nodes, 1002 IOs, 27516edges, and 13758 cubes which is already reported in TablesI, II) such as log2 and hypotenuse [27], but the memory of our system (64GB RAM) was not enough for ﬁnishing DFFinsertion and retiming steps of Baseline1 and Baseline2, andthe processes were killed after 30+ minutes.To verify the correct operation of the circuits, we simulateda 2-bit Kogge-Stone Adder (KSA2) generated by DLGP-basedDCM given p = . Four sets of random values as shown in Fig.6 are applied to the inputs and for all of them the correctoutputs are generated. Please note that since p = , the inputsare repeated and there are 2 copies of each input.These savings come in one expense; the peak throughput ofcircuits generated by DLGP-based DCM will be roughly p × less than FPB method. This is because the effective frequencyof these circuits is the same as macro clock compared with thefrequency of circuits generated by FPB which is micro clock.Note that the actual throughput is typically much less thanthe peak throughput (due to instruction data dependencies,program branches, etc.); so some throughput loss may beacceptable. In addition, due to the following property of SFQcircuits, the throughput loss will be less than p × after place-and-route; in SFQ circuits, the delay of interconnects aretypically larger than the delay of gates, hence, the longestinterconnect usually determines the worst case delay. There-fore, since DLGP-based DCM reduces the total gate countsigniﬁcantly, it will help reducing the length of the longestinterconnect, resulting in having faster local clock frequency.This helps gaining some of the lost throughput in DLGP-based DCM. For example, the throughput loss for ISCAS c432circuit generated by DLGP-based DCM ( p = ) after place-and-route is reduced to . × compared with FPB. A moreadvanced wire-routing method such as what presented in [28]can help reducing this gap further.VII. C ONCLUSION

This paper introduces a new graph partitioning problemcalled Depth-bounded Levelized Graph Partitioning (DLGP).In DLGP, there is a depth constraint on the resulting sub-graphs of each part. We showed that by transforming theDLGP problem into a Depth-bounded Chain Graph Partition-ing (DCGP) problem, an optimal solution which minimizesthe total cut set is achieved using the Dynamic Programming clk a a b b c in s s c out th th th th Fig. 6: Simulation results for a 2-bit Kogge-Stone adder (KSA2)generated by DLGP-based DCM given p = . clk refers to the fastclock. Four sets of random inputs are applied: a =1010, a =0101, b =0110, b =1100, c in =1001. The correct outputs are: S =0101, S =0011, C out =1100, which are generated every 2 clock cycle (fastclock). TABLE I: Experimental results for DLGP-based DCM, Baseline1 (FPB), and Baseline2 (FPB + retiming). Area is in mm and run-time isin sec . For DLGP-based dual clocking method (DCM), two cases of p = , and p = are considered. Area

TABLE II: Comparing DFF count for DLGP-based DCM and twobaselines. algorithm. It is shown that DLGP algorithm can be appliedto SFQ circuits for reducing the path balancing overheads.Experimental results show that if the depth constraint is equalto 5, this overhead reduction is as high as . × in terms ofJJ count, and . × in terms of DFF count.R EFERENCES[1] A. Buluc¸, H. Meyerhenke, I. Safro, P. Sanders, and C. Schulz, “Recentadvances in graph partitioning,” in

Algorithm Engineering . Springer,2016, pp. 117–158.[2] C. J. Alpert and A. B. Kahng, “Recent directions in netlist partitioning:a survey,”

Integration, the VLSI journal , vol. 19, no. 1-2, pp. 1–81, 1995.[3] W. Chen, A. Rylyakov, V. Patel, J. Lukens, and K. Likharev, “Rapid sin-gle ﬂux quantum t-ﬂip ﬂop operating up to 770 ghz,”

IEEE Transactionson Applied Superconductivity , vol. 9, no. 2, pp. 3212–3215, 1999.[4] D. S. Holmes, A. L. Ripple, and M. A. Manheimer, “Energy-efﬁcientsuperconducting computing-power budgets and requirements,”

IEEETransactions on Applied Superconductivity , vol. 23, no. 3, pp. 1 701 610–1 701 610, 2013.[5] K. Likharev and V. Semenov, “Rsfq logic/memory family: A newjosephson-junction technology for sub-terahertz-clock-frequency digitalsystems,”

IEEE Transactions on Applied Superconductivity , vol. 50,no. 1, 1991.[6] M. R. Garey, D. S. Johnson, and L. Stockmeyer, “Some simpliﬁed np-complete problems,” in

Proceedings of the sixth annual ACM symposiumon Theory of computing . ACM, 1974, pp. 47–63.[7] L. Hyaﬁl and R. L. Rivest,

Graph partitioning and constructing optimaldecision trees are polynomial complete problems . IRIA. Laboratoirede Recherche en Informatique et Automatique, 1973.[8] H. R. Lewis, “Computers and intractability. a guide to the theory ofnp-completeness,” 1983. [9] B. W. Kernighan and S. Lin, “An efﬁcient heuristic procedure forpartitioning graphs,”

The Bell system technical journal , vol. 49, no. 2,pp. 291–307, 1970.[10] C. M. Fiduccia and R. M. Mattheyses, “A linear-time heuristic for im-proving network partitions,” in

Papers on Twenty-ﬁve years of electronicdesign automation . ACM, 1988, pp. 241–247.[11] D. S. Johnson, C. R. Aragon, L. A. McGeoch, and C. Schevon,“Optimization by simulated annealing: an experimental evaluation; parti, graph partitioning,”

Operations research , vol. 37, no. 6, pp. 865–892,1989.[12] C. J. Alpert, J.-H. Huang, and A. B. Kahng, “Multilevel circuit parti-tioning,”

IEEE Transactions on Computer-Aided Design of IntegratedCircuits and Systems , vol. 17, no. 8, pp. 655–667, 1998.[13] H. Liu, K. Zhu, and D. Wong, “Circuit partitioning with complexresource constraints in fpgas,” in

Proceedings of the 1998 ACM/SIGDAsixth international symposium on Field programmable gate arrays .ACM, 1998, pp. 77–84.[14] H. Liu and D. Wong, “Network-ﬂow-based multiway partitioning witharea and pin constraints,”

IEEE Transactions on Computer-Aided Designof Integrated Circuits and Systems , vol. 17, no. 1, pp. 50–59, 1998.[15] L. Hagen and A. B. Kahng, “New spectral methods for ratio cut parti-tioning and clustering,”

IEEE transactions on computer-aided design ofintegrated circuits and systems , vol. 11, no. 9, pp. 1074–1085, 1992.[16] F. McSherry, “Spectral partitioning of random graphs,” in

Foundationsof Computer Science, 2001. Proceedings. 42nd IEEE Symposium on .IEEE, 2001, pp. 529–537.[17] S. Lafon and A. B. Lee, “Diffusion maps and coarse-graining: A uniﬁedframework for dimensionality reduction, graph partitioning, and data setparameterization,”

IEEE transactions on pattern analysis and machineintelligence , vol. 28, no. 9, pp. 1393–1403, 2006.[18] E. L. Lawler, K. N. Levitt, and J. Turner, “Module clustering to minimizedelay in digital networks,”

IEEE Transactions on Computers , vol. 100,no. 1, pp. 47–57, 1969.[19] R. Rajaraman and D. Wong, “Optimum clustering for delay minimiza-tion,”

IEEE transactions on computer-aided design of integrated circuitsand systems , vol. 14, no. 12, pp. 1490–1495, 1995.[20] S. Hauck and G. Borriello, “An evaluation of bipartitioning techniques,”

IEEE Transactions on Computer-Aided Design of Integrated Circuitsand Systems , vol. 16, no. 8, pp. 849–866, 1997.[21] S.-J. Chen and C.-K. Cheng, “Tutorial on vlsi partitioning,”

VLSI design ,vol. 11, no. 3, pp. 175–218, 2000.[22] N. Katam, A. Shafaei, and M. Pedram, “Design of complex rapidsingle-ﬂux-quantum cells with application to logic synthesis,” in .[23] C. E. Leiserson and J. B. Saxe, “Retiming synchronous circuitry,”

Algorithmica , vol. 6, no. 1, pp. 5–35, 1991.[24] C. Fourie. (2018) SFQ cell library. [Online]. Available: https://github.com/sunmagnetics/RSFQlib[25] B. L. SYNTHESIS, “ABC: A system for sequential synthesis andveriﬁcation,”

Berkeley Logic Synthesis and Veriﬁcation Group , 2011.[26] M. C. Hansen, H. Yalcin, and J. P. Hayes, “Unveiling the iscas-85benchmarks: A case study in reverse engineering,”

IEEE Design & Testof Computers , vol. 16, no. 3, pp. 72–80, 1999.[27] EPFL. (2017) The epﬂ combinational benchmark suite. [Online].Available: https://lsi.epﬂ.ch/benchmarks[28] N. Kito, K. Takagi, and N. Takagi, “A fast wire-routing method andan automatic layout tool for rsfq digital circuits considering wire-lengthmatching,”