[PDF] Automatic Conversion from Flip-flop to 3-phase Latch-based Designs

Abstract

Latch-based designs have many benefits over their flip-flop based counterparts but have limited use partially because most RTL specifications are flop-centric and automatic conversion of FF to latch-based designs is challenging. Conventional conversion algorithms target master-slave latch-based designs with two non-overlapping clocks. This paper presents a novel automated design flow that converts flip-flop to 3-phase latch-based designs. The resulting circuits have the same performance as the master-slave based designs but require significantly less latches. Our experimental results demonstrate the potential for savings in the number of latches (21.3%), area (5.8%), and power (16.3%) on a variety of ISCAS, CEP, and CPU benchmark circuits, compared to the master-slave conversions.

Full PDF

AAutomatic Conversion from Flip-ﬂop to 3-phaseLatch-based Designs

Huimei Cheng, Yichen Gu, and Peter A. Beerel ∗ Ming Hsieh Department of Electrical and Computer Engineering, University of Southern California { huimeich,yichengu,pabeerel } @xyz.edu ABSTRACT

Latch-based designs have many beneﬁts over their ﬂip-ﬂopbased counterparts but have limited use partially becausemost RTL speciﬁcations are ﬂop-centric and automatic con-version of FF to latch-based designs is challenging. Con-ventional conversion algorithms target master-slave latch-based designs with two non-overlapping clocks. This paperpresents a novel automated design ﬂow that converts ﬂip-ﬂop to 3-phase latch-based designs. The resulting circuitshave the same performance as the master-slave based designsbut require signiﬁcantly less latches. Our experimental re-sults demonstrate the potential for savings in the number oflatches (21.3%), area (5.8%), and power (16.3%) on a vari-ety of ISCAS, CEP, and CPU benchmark circuits, comparedto the master-slave conversions.

1. INTRODUCTION

The growing use of portable/wireless electronic systemsand Internet-of-Things (IoT) applications motivates the de-sire of smaller and more energy-eﬃcient designs in today’svery large scale integration (VLSI) circuits. One of two de-vices: edge-triggered ﬂip-ﬂops (FFs) or level-sensitive latchesare typically used as synchronization and state storage. It iswell-known that latch-based designs can lead to lower powerand area than FF-based designs due to time borrowing,smaller cell area, and lower capacitance [1–3], particularlywhen process variation is considered [4]. They are also crit-ical for architecturally-agnostic timing resilient designs [5,6]which can remove unnecessary margins associated with PVTvariations and make near-threshold computing more practi-cal.As an intermediate between latch and ﬂip-ﬂop based de-signs, pulsed-latch schemes have also been proposed [7, 8].These rely on an edge-triggered pulse generator to providea short transparency window to all latches. To minimizeenergy overhead, multi-bit pulsed-latch schemes have beenproposed that share pulse generators among several latchcells [9]. Pulsed-latches, however, must be used carefullybecause they are subject to hold problems and pulse widthvariations that are challenging to predict, control, and mit-igate (see e.g., [10]).A basic challenge to adopting any form of latch-based de-sign is that most RTL speciﬁcations are designed using edge ∗ This work was partially supported by NSF Grant sensitive FFs. Approaches to automatically converting anFF- to latch-based design are thus attractive. Most con-version ﬂows convert the FF-based designs into pulsed-latchdesigns [11] or two-phase latch-based designs controlled byeither master-slave clocks [12] or bundled-data asynchronouscontrollers [6, 13–16].Optimization of latch-based designs has also been givensome attention in the literature. For example, [2] exploresusing a mix of master-slave latches and FFs/pulsed-latches.Others take advantage of the time borrowing to boost per-formance and/or reduce area and power consumption [2,12]. Moreover, retiming algorithms of timing-resilient latch-based designs have been developed that consider not onlythe number of latches required but also the impact of theamount of needed error-detecting logic [17].Whereas two-phase designs are inherently more robustthan pulsed-latch designs, we argue they can be overly re-strictive and that multi-phase latch-based designs [18] cansometimes be an attractive alternative.The key contribution of this paper is to demonstrate thata FF-based design can be automatically converted into arobust multi-phase design with fewer latches than a two-phase design. In particular, we convert a FF-based to 3-phase latch-based design using a novel Integer Linear Pro-gram (ILP) that minimizes latches and retiming to ensureno performance loss. Our experimental results show an over-all average reduction in number of latches of 23% comparedto the conventional master-slave designs on ISCAS89 cir-cuits [19], CEP submodules [20], and three CPU designs(i.e. a 3-stage MIPS CPU Plasma [21], a RISC-V RocketCore [22], and an ARM Cortex-M0 core [23]).This paper is organized as follows. Section 2 introducesbackground on multi-phase latch-based designs. Section 3describes the design constraints we adopt in our conversionalgorithm and the area-performance tradeoﬀs they repre-sent. Section 4 introduces our ILP-baed conversion algo-rithm and Section 5 presents the experimental results basedon a broad range of designs. Finally, some conclusions aredrawn in Section 6.

2. BACKGROUND

The Sakallah, Mudge, and Okulotun (SMO) model [18]deﬁnes an optimal framework for multi-phase latch-baseddesigns. It deﬁnes a k -phase clock as a collection of k peri-odic signals with a common cycle time and associated timingconstraints, called the General System Timing Constraints a r X i v : . [ c s . A R ] J un GSTC). The phases ( p , p , ... p k ) are ordered in a globaltime reference: e i − ≤ e i ; e k = T c , where e i is the closingtime of phase p i . E ij is the forward phase shift from phase p i to phase p j deﬁned below. E ij = (cid:40) ( e j − e i ) , i < j ( T c + e j − e i ) , i ≥ j (1)Then, the worst-case setup and hold constraints for eachphase is deﬁned as follows.Hold: H i ≤ d j + δ j + δ ji − E p j p i Setup: T c − S i ≥ D j + ∆ j + ∆ ji − E p j p i (2)Here, H i and S i stands for the hold and setup time of the i th latch. The shortest (longest) path delay from the j th latchto the i th latch is denoted as δ ji (∆ ji ) and the minimal(maximal) delay value of the j th latch is δ j (∆ j ). d j ( D j )represents the earliest (latest) signal departure time, i.e., theamount of time after the last e j that the next data starts topropagate through the j th latch [18]. T c denotes the cycletime and we assume all clock phases share the same highpulse width T p in this paper.

3. LATCH-BASED DESIGNS

This paper’s goal is to convert an FF-based to latch-baseddesign minimizing the number of latches based on a reason-able set of constraints. This section explores the implicittrade-oﬀs associated with these constraints and motivatesour three-phase clocking approach.

There are two constraints we adopt that are designed tomake the application of latch-based designs easier.C1: the original position of all FFs must be latched;C2: neighboring latches, connected by combinational logic,must not be simultaneously transparent;Constraint C1 is designed to make logical equivalence check-ing between the latch and FF-designs easier. In particu-lar, we will convert every FF to a latch and only add extralatches where necessary to meet these constraints. Duringlogical equivalence checking the ﬁxed latches can be viewedas FFs and the extra latches can be treated as transparent.Ensuring latches are present at the same position as theoriginal FFs also guarantees the ability to reset the circuitin the same state [24].Constraint C2 is designed to avoid min-delay problems.In particular, even with min delay paths equal to 0 ( δ i = δ ij = 0) the hold constraint is satisﬁed with zero hold times( H i = 0). This constraint is particularly important whenconsidering an FF with combinational feedback. If no extralatch is added during conversion, the converted circuit wouldhave a single latch i with combinational feedback which vio-lates C

2. This conﬁguration is dangerous because the trans-parency phase of the latch must be smaller than the mini-mum delay of the combinational feedback δ ii to avoid a hold This follows because constraint C2 means E p j p i ≤ T c − T p and the signal can start to propagate through a latch onlyafter it opens d j ≥ T c − T p . Figure 1: Converting a linear FF-based pipeline (a) to a 2-phaselatch-based pipeline (b) and to a 3-phase latch-based pipeline (c) violation. More precisely, the constraint can be formalizedas: δ i + δ ii ≥ H i + T p . The key point is that this constraint guarantees this conﬁg-uration is not allowed. In particular, any solution that sat-isﬁes this constraint will break such combinational feedbackby at least two latches that have non-overlapping clocks.A well-known but non-optimal solution to this problem isto convert every

FF into two latches, a master and a slavelatch, as in [2,13], and retime the slave latches. This master-slave approach satisﬁes both constraints C1 and C2 but atthe cost of doubling the number of sequential elements. Thatis, before retiming, the extra number of latches added isexactly equal to the number of FFs.

It is interesting to consider the special case of a linearpipeline because they have no FFs with combinational feed-back that must be considered. Such a pipeline is illus-trated in Figure 1(a) and its cycle time T c is no shorterthan ∆ + ∆ + S , where ∆ represents the FF’s clk-to-q delay, ∆ represents the longest data-path delay, and S stands for the FF’s setup time.Such linear pipelines can be converted to a latch-baseddesign adding no extra latches, where we clock alternatingpipeline stages with alternating phases of a two-phase non-overlapping clock, as illustrated in Figure 1(b). The problemwith this solution is that if each combinational logic stage iscritical, the time separation between each phase of the clockmust be equal to the original cycle time, i.e., E ij = T c , where T c represents the original cycle time. Letting T Pc denotethe cycle time of the two-phase non-overlapping clocks andassuming E = E = T c , Equation 1 implies e − e = T Pc + e − e = T c and thus, T Pc = 2 T c . In other words,the frequency of the two-phase clocks must be half that ofthe original FF-based design.his analysis highlights the fact that there is a trade-oﬀbetween the number of extra latches added and the perfor-mance of the resulting circuit. To avoid this trivial solutionin our formulation, we adopt a third constraint:C3: the converted latch-based design must have the samethroughput as the FF-based design assuming the com-binational logic is already critical.We can achieve a latch-based design that meets all Con-straints C1-C3 in which we add exactly one extra latchstage for every other original pipeline stage using a 3-phaseclocked, as illustrated in Figure 1(c). Notice that as de-sired, this solution has the same throughput as the originalpipeline having phases p and p open and close their respec-tive latches at the rising edge of the FF-based clock. We relyon the p latches time borrowing to properly capture nearcritical combinational paths. The p latches inserted be-tween the p and p latches prevent data latched by p toviolate the hold times of the subsequent p latches. A natural question to ask is if 3-phase clocking guaranteesoptimality in terms of the number of required extra latches.This section proves that it is optimal for linear pipelines butdoes not guarantee optimality for more general non-linearpipelines.

Theorem I:

At least one latch stage has to be insertedbetween any 3 consecutive stages of a linear pipeline.

Proof by contradiction:

Assume there exists three con-secutive stages of a linear pipeline for which no extra latchstage is inserted within the combinational logic betweenstage 1 and stage 2 or between stage 2 and stage 3.Let time 0 represent the rising edge of the stage 1 clock.According to Constraints C2 and C3, stage 2 clock can onlygo high during the time window ( T p , T c − T p ) and must golow no later than T c .Case 1: Assume stage 1 data is valid at time 0. Sincethere is no latch between stage 1 and stage 2, stage 2 clockcaptures data no earlier than T c . Then stage 2 clock shouldbe high during the time period ( T c − T p , T c ). Accordingto Constraints C2 and C3, stage 3 can only go low duringthe period ( T c + T p , 2 T c − T p ). This means that stage 3has to capture data before time 2 T c − T p . Because there isno extra latch inserted between stage 2 and stage 3, stage 3must capture the data no earlier than 2 T c . This, however,contradicts the fact that stage 3 must go low before 2 T c − T p .Case 2: Assume the data leaves stage 1 at time t (0

4. CONVERSION ALGORITHM

Our conversion approach is to automatically decomposethe FFs into two groups, ones that will be converted to back-to-back connected latches and ones that will be convertedinto a single latch. The group of FFs converted to a singlelatch are assigned to clock phase p . The remaining FFsare converted to latches clocked by either p or p . For thisgroup, an additional latch clocked by p is inserted at eachlatches’ output to create a back-to-back conﬁguration. Thismeans that, by construction, there is no direct data pathfrom p to p latches. Min delay related hold problems areavoided by allowing an FF to be assigned to phase p andconverted to a single latch only if none of its fanout FFs arealso assigned to p . Each FF is treated as a node u and its F O ( u ) is the setf FFs that can be reached from the FF u via only com-binational logic. Every node u has two binary parameters, G ( u ) and K ( u ). G ( u ) decides which group of latches to as-sign node u , either the back-to-back latch group ( G ( u ) = 1)or the single-latch group ( G ( u ) = 0). K ( u ) determines thenode u ’s clock phase, 1 implies u is clocked by p and 0 im-plies u is clocked by p . All inserted latches are driven by p .Our ILP automatically performs this assignment minimizingthe number of back-to-back latches as follows: Minimize (cid:88) u G ( u )Subject to: ∀ u ∈ V : G ( u ) =  , K ( u ) = 0 , , K ( u ) = 1 ∧ ∃ v ∈ F O ( u ) K ( v ) = 10 , otherwiseK ( u ) = (cid:40) , ∀ u ∈ P I { , } , ∀ u ∈ V Here

P I stands for the set of all primary input ports andset V contains all nodes in the circuit. To provide consis-tency to the interface of the design, we assign all primaryinput ports ( PI s) as if they were clocked by p .To make the ILP compatible with Gurobi [25], we convertthe conditional equations into inequalities: Minimize (cid:88) u G ( u )Subject to:  G ( u ) + K ( u ) ≥ ∀ u ∈ VG ( u ) ≥ K ( u ) + K ( v ) − ∀ u ∈ V, ∀ v ∈ F O ( u ) G ( u ) ≥ K ( v ) ∀ u ∈ P I, ∀ v ∈ F O ( u ) G ( u ) , K ( u ) ∈ { , } The ﬁrst constraint that implies when K ( u ) = 0 inequality G ( u ) ≥ G ( u ) = 1 if K ( u )and any of its fanout K ( v ) are both 1, rephrasing the secondcondition in (3). Applying the assumption that all PIs areclocked by p to the second constraint above, we obtain thethird inequality. The ILP described in the last section is the core step in adesign ﬂow that supports FF-based to 3-phase latch-baseddesign conversion. The ﬁrst step of our design ﬂow is torun standard synchronous synthesis on the given FF-basedRTL design. Here, we take care to enable clock gating tominimize the number of FFs with self-loops which wouldotherwise unduly constrain the optimization problem. Tobe speciﬁc, the gated clock, shown in Figure 3(b), is set tobe the preferred clock gating style, as compared to enabledclocks illustrated in Figure 3(a).Using Python and TCL scripts that interface a leadingcommercial logic synthesis tool to the Gurobi Integer LinearProgram solver [25], we then take the resulting FF-baseddesign, identify the connections between FFs, and formulate

Figure 3: Enabled (a) to gated clock (b) transformationFigure 4: Duplicated clock gating logic for phase conversion the ILP described in Section 4.1. We run the ILP, and,using the results, create the equivalent 3-phase latch-basedsynchronous design by deﬁning the three-phase clocks andconnecting them to their associated latches.For each latch that are clock gated, we trace the clocksignal back through the clock gating logic and replace theclock with p p

3. In the case of latches belonging to thesame clock gating register bank but driven by diﬀerent clockphases, the clock gating logic is duplicated and connected tothe two clock phases separately, as shown in Figure 4. Wethen retime the newly added latches, as described below.The last step in the design ﬂow, left as future work in thispaper, is the physical design step which includes implemen-tation of the three-phase clock trees.

Retiming re-positions the added latches within the combi-national logic minimizing area while satisfying all latch con-straints. Unfortunately, many commercial tools have lim-ited support for retiming latches. They do, however, havewell-optimized support for the retiming of FFs. Using thisfact, [26] proposed to retime latches by mapping it to an FF-based retiming problem. Given a synthesized design withclock period T c , they replace each FF with two FFs and re-time the entire design with a faster clock constraint of halfthe original period ( T c / p and p are mapped to clk and p istied to clkbar. We then retime the circuit only allowing FFstied to clkbar to move. This splits the combinational logicin the pipeline stages that require an extra latch into twowith each part being able to operate at twice the frequency(cycle time T c / igure 5: 3-phase clocks for modiﬁed retiming assignments. Further optimization is then triggered to op-timize the sizes of gates in the retimed latch-based design.

5. EXPERIMENTAL RESULTS

This section quantiﬁes the beneﬁts of the proposed con-version algorithm comparing the resulting 3-phase designto the original FF-based as well as traditional master-slavelatch-based designs. The experiments rely on an industrial28-nm FDSOI CMOS cell library and a range of circuitsthat include, ISCAS89 benchmark circuits [19], CEP sub-modules [20], and three CPU designs, a 3-stage MIPS OpenCore Plasma [21], a RISC-V Rocket Core [22], and an ARM-M0 core [23]. We validated both master-slave and 3-phaselatch-based circuits by streaming inputs to the FF-based andlatch-based designs and compare output streams. Thesegate-level simulations were also used to determine signal ac-tivity used to measure the relative power consumption ofour approach. Note, however, that because our results arepost-synthesis, our analysis does not consider the power con-sumption of the clock trees. All experiments were run on twoIntel Xeon E5-2450 v2 CPUs with 128GB of RAM.Note that for a fair comparison, all designs are run atthe same frequency and the modiﬁed work-around retim-ing strategy described in Section 4 is also performed on themaster-slave latch-based designs.Table 1 summarizes the number of registers (FFs/latches)in the original FF-based, conventional master-slave latch-based, and 3-phase latch-based designs. The right most twocolumns show the savings of our approach in terms of thenumber of latches in 3-phase latch-based designs comparedto the doubled number of FFs in FF-based and the numberof latches in master-slave latch-based designs, respectively.The results show that the proposed algorithm reduces thenumber of latches by an average of 23.4% and 21.3% com-pared to FF-based and master-slave latch-based designs, re-spectively. Notice that the 3-phase algorithm has the leastoverall beneﬁt on the ISCAS89 circuits and, in particular,no beneﬁt on s1488 and s1423. According to [27], s1488 isre-synthesized from a controller and may suggest that ouralgorithm brings limited beneﬁts to control dominated de-signs that have a predominance of FFs with combinationalfeedback. For ISCAS designs we used auto-generated pseudo-randominput streams. For CEP and CPU designs, we used theopen-source provided testbenches. In particular, Plasmawas running the “pi” program, ARM-M0 was running the“hello world” program, RISC-V was running the “rv32ui-v-simple” program, and CEP designs were running the open-source provided self-check programs.

Design FF M-S 3-phase Save (%)2*FF M-SISCAS s1196 18 36 26 27.8 27.8s1238 18 36 26 27.8 27.8s1423 74 158 167 -12.8 -5.7s1488 6 12 12 0.0 0.0s5378 164 326 250 23.8 23.3s9234 145 299 257 11.4 14.0s13207 460 905 761 17.3 15.9s15850 449 922 818 8.9 11.3s35932 1728 3456 2738 20.8 20.8s38417 1490 2953 2466 17.2 16.5s38584 1268 2621 2478 2.3 5.5Average 529.1 1065.8 909.0 14.1 14.7CEP AES 9703 17760 13578 30.0 23.5DES3 425 861 594 30.1 31.0SHA256 1554 3133 2581 17.0 17.6MD5 782 1586 1086 30.6 31.5Average 3116.0 5835.0 4459.8 28.4 23.6CPU Plasma 1554 3159 2150 30.8 31.9RISC-V 2561 5226 4178 18.4 20.1ARM-M0 1334 2738 2185 18.1 20.2Average 1816.3 3707.7 2837.7 21.9 23.5Average 1318.5 2565.9 2019.5 23.4 21.3

Table 1: Number of registers (FFs or latches) in the originalﬂip-ﬂop (FF), converted master-slave latch (M-S), and proposed3-phase latch based designs

Table 2 shows the areas of combinational, sequential logic,and the total for each benchmark for FF, master-slave, and3-phase designs. It also shows the percentage area reduc-tions for the 3-phase designs when compared to both theFF- and master-slave designs. According to the table, the3-phase designs achieve an average of 8.4% and 5.8% sav-ings in total area compared to FF-based and master-slavelatch-based designs, respectively. Notice that the three CPUbenchmarks show a relatively high area reduction over master-slave designs but a relatively low area saving compared toFF-based designs. This is a result of the fact that con-verting FF- to latch-based designs sometimes increases thecombinational logic area depending on the results of retim-ing. In particular, for the CPU designs, the average areaof combinational logic increases by 10.2% and 3.4% for 3-phase compared to FF-based and master-slave latch-baseddesigns. On the other hand, the area of the combinationallogic changes less in the ISCAS and CEP designs. To bespeciﬁc, the combinational logic area of ISCAS and CEP 3-phase designs are increased by 3.5% and decreased by 4.6%with respect to FF-based designs and increased by an aver-age of 1.6% and 2.3% over master-slave latch-based designs,respectively. Note the degree of logic area increase is clock-frequency dependent and re-running these experiments atlower frequencies, reduces this impact.Table 3 reports the power dissipation of the resulting de-signs based on the speciﬁc signal activities determined byour back-annotated gate-level simulations.The 3-phase latch-based designs show an average power reduction of 40.8%compared to the FF-based designs and 16.3% compared tothe master-slave latch-based designs. The table shows thatthe proposed approach can save up to 75% of the powerconsumption at the same frequency when compared to tra-ditional FF-based designs. The improvement over master-slave latch-based designs are more consistent and not as sig-niﬁcant as FF-based designs. In particular, the maximalpower deduction is 40%, and an average of 12%, 26%, and esign FF area M-S area 3-phase area Save (%) wrt. FF Save (%) wrt. M-SComb Seq Total Comb Seq Total Comb Seq Total Comb Seq Total Comb Seq TotalISCAS s1196 172.7 67.6 240.2 163.4 58.8 222.1 167.9 44.1 212.0 2.7 34.8 11.8 -2.8 25.0 4.6s1238 168.6 67.6 236.2 161.1 58.8 219.8 162.1 44.1 206.1 3.9 34.8 12.7 -0.6 25.0 6.2s1423 210.5 292.8 503.3 212.2 269.6 481.8 205.1 270.4 475.6 2.6 7.6 5.5 3.3 -0.3 1.3s1488 194.5 22.5 217.1 188.2 19.9 208.1 191.9 19.6 211.5 1.3 13.0 2.6 -2.0 1.6 -1.6s5378 396.3 615.6 1011.8 387.3 532.0 919.3 388.1 420.7 808.8 2.1 31.7 20.1 -0.2 20.9 12.0s9234 287.7 557.5 845.2 293.1 494.3 787.4 271.2 423.2 694.4 5.7 24.1 17.8 7.5 14.4 11.8s13207 551.5 1750.2 2301.6 531.9 1486.4 2018.3 584.4 1272.1 1856.6 -6.0 27.3 19.3 -9.9 14.4 8.0s15850 842.0 1723.9 2565.8 923.4 1513.7 2437.1 815.0 1353.1 2168.1 3.2 21.5 15.5 11.7 10.6 11.0s35932 3087.9 6486.2 9574.1 3046.1 5640.2 8686.3 3056.2 4585.6 7641.8 1.0 29.3 20.2 -0.3 18.7 12.0s38417 2622.6 5650.8 8273.4 2787.8 4835.1 7622.9 3042.2 4110.7 7152.9 -16.0 27.3 13.5 -9.1 15.0 6.2s38584 3169.2 4905.0 8074.2 3225.6 4307.3 7533.0 3227.6 4051.0 7278.6 -1.8 17.4 9.9 -0.1 6.0 3.4Average 1063.9 2012.7 3076.6 1083.6 1746.9 2830.6 1101.1 1508.6 2609.7 -3.5 25.0 15.2 -1.6 13.6 7.8CEP AES 102418.8 25367.3 127786.1 94989.6 26086.9 121076.4 97769.5 19943.4 117712.9 4.5 21.4 7.9 -2.9 23.6 2.8DES3 1462.8 1127.4 2590.2 1503.7 1266.6 2770.3 1467.5 872.5 2340.0 -0.3 22.6 9.7 2.4 31.1 15.5SHA256 4383.4 4131.2 8514.6 4495.8 4630.3 9126.1 4285.0 3791.0 8076.0 2.2 8.2 5.2 4.7 18.1 11.5MD5 4270.8 2117.9 6378.0 3907.0 2406.9 6313.9 3837.3 1612.4 5449.7 10.1 23.9 14.6 1.8 33.0 13.7Average 28133.9 8186.0 36317.2 26224.0 8597.7 34821.7 26839.8 6554.8 33394.6 4.6 19.9 8.0 -2.3 23.8 4.1CPU Plasma 3998.6 4749.9 8748.5 4192.3 4879.2 9071.5 4781.4 3356.4 8137.8 -19.6 29.3 7.0 -14.1 31.2 10.3RISC-V 7115.0 6860.6 13975.6 7710.2 7765.1 15475.3 7929.1 6138.3 14067.4 -11.4 10.5 -0.7 -2.8 20.9 9.1ARM-M0 6458.3 3985.2 10443.5 6821.9 4639.4 11461.4 6655.1 3326.2 9981.3 -3.0 16.5 4.4 2.4 28.3 12.9Average 5857.3 5198.6 11055.9 6241.5 5761.2 12002.7 6455.2 4273.6 10728.8 -10.2 17.8 3.0 -3.4 25.8 10.6Average 7878.4 3915.5 11793.3 7530.0 3938.4 11468.4 7713.2 3090.8 10804.0 2.1 21.1 8.4 -2.4 21.5 5.8

Table 2: Areas ( µm ) of ﬂip-ﬂop (FF), master-slave latch (M-S), and 3-phase latch-based designs Design FF power M-S power 3-phase power Total Save (%)Comb Seq Total Comb Seq Total Comb Seq Total FF vs 3-P M-S vs 3-PISCAS s1196 0.20 0.12 0.32 0.19 0.07 0.26 0.21 0.05 0.26 20.36 1.60s1238 0.20 0.12 0.32 0.20 0.07 0.27 0.21 0.05 0.26 19.07 3.83s1423 0.29 0.38 0.66 0.15 0.25 0.43 0.15 0.26 0.41 37.39 4.49s1488 0.18 0.04 0.22 0.20 0.02 0.23 0.20 0.02 0.22 0.99 2.47s5378 0.37 1.00 1.37 0.35 0.56 0.91 0.40 0.48 0.88 35.61 3.78s9234 0.17 0.62 0.79 0.08 0.43 0.53 0.08 0.38 0.46 42.28 14.84s13207 0.44 2.19 2.63 0.32 1.44 1.78 0.33 1.28 1.61 38.83 9.46s15850 2.28 0.44 2.72 0.56 1.25 1.88 0.44 1.39 1.83 32.91 3.09s35932 12.87 2.74 15.62 2.70 8.59 11.30 2.89 7.48 10.36 33.64 8.26s38417 6.08 2.05 8.13 2.22 3.29 5.55 2.36 2.68 5.04 37.92 9.15s38584 9.56 3.31 12.87 4.63 6.23 11.14 3.67 5.00 8.67 32.63 22.18Average 2.97 1.18 4.15 1.06 2.02 3.12 0.99 1.73 2.73 34.28 12.52CEP AES 0.22 12.64 12.86 0.19 4.02 4.21 0.17 3.02 3.19 75.21 24.35DES3 0.48 0.31 0.79 0.47 0.27 0.74 0.40 0.16 0.56 28.41 24.04SHA256 0.17 0.11 0.27 0.12 0.39 0.52 0.12 0.25 0.38 -37.78 27.39MD5 0.29 0.05 0.34 0.27 0.19 0.47 0.16 0.12 0.28 16.50 40.15Average 0.29 3.27 3.56 0.26 1.22 1.49 0.21 0.89 1.10 69.08 25.82CPU Plasma 0.84 0.81 1.65 1.06 0.72 1.81 0.64 0.52 1.16 29.75 35.97RISC-V 0.54 0.32 0.86 1.03 0.42 1.48 0.36 0.66 1.02 -18.01 31.06ARM-M0 1.25 0.76 2.01 0.69 1.34 2.05 1.05 0.51 1.56 22.73 23.90Average 0.88 0.63 1.51 0.93 0.83 1.78 0.68 0.56 1.25 17.54 29.98Average 2.02 1.56 3.58 0.86 1.64 2.53 0.77 1.35 2.12 40.80 16.30

Table 3: Power consumption (mW) based on simulation in the original ﬂip-ﬂop (FF), converted master-slave latch (M-S), and proposed3-phase (3-P) latch-based designs esign FF power M-S power 3-phase power Total Save (%)Comb Seq Total Comb Seq Total Comb Seq Total FF vs 3-P M-S vs 3-PISCAS s1196 0.12 0.09 0.21 0.12 0.05 0.16 0.11 0.03 0.15 29.53 10.01s1238 0.11 0.09 0.21 0.12 0.05 0.16 0.11 0.04 0.15 28.73 10.77s1423 0.26 0.34 0.60 0.16 0.22 0.37 0.13 0.21 0.35 42.08 7.61s1488 0.16 0.04 0.20 0.17 0.02 0.19 0.16 0.02 0.18 8.26 3.66s5378 0.48 0.83 1.32 0.46 0.44 0.89 0.47 0.37 0.84 36.47 6.30s9234 0.26 0.62 0.88 0.19 0.42 0.61 0.17 0.37 0.53 39.54 12.67s13207 0.53 2.11 2.64 0.36 1.35 1.71 0.38 1.17 1.54 41.47 9.75s15850 0.85 1.86 2.71 0.62 1.37 1.99 0.54 1.20 1.74 35.66 12.37s35932 2.99 9.54 12.53 2.95 5.86 8.81 2.79 4.73 7.51 40.04 14.71s38417 2.64 4.82 7.46 3.03 4.56 7.59 2.25 3.18 5.43 27.17 28.44s38584 3.38 5.04 8.42 3.07 4.55 7.62 2.22 3.52 5.74 31.91 24.76Average 1.07 2.31 3.38 1.02 1.72 2.74 0.85 1.35 2.20 35.00 19.78CEP AES 24.50 18.69 43.19 24.61 11.41 36.02 21.55 7.12 28.67 33.62 20.42DES3 0.78 0.74 1.53 0.68 0.40 1.08 0.64 0.24 0.89 41.84 17.44SHA256 2.01 2.82 4.83 1.54 1.70 3.24 1.42 1.18 2.60 46.07 19.69MD5 2.26 1.33 3.59 1.97 0.88 2.85 1.33 0.49 1.82 49.30 36.18Average 7.39 5.90 13.28 7.20 3.60 10.80 6.24 2.26 8.49 36.04 21.33CPU Plasma 1.57 1.67 3.24 1.92 2.63 4.55 1.67 1.45 3.12 3.50 31.44RISC-V 2.60 2.48 5.08 2.80 3.26 6.06 2.21 2.10 4.31 15.23 28.85ARM-M0 1.75 1.03 2.79 2.25 1.72 3.97 1.93 0.96 2.89 -3.66 27.30Average 1.98 1.73 3.70 2.32 2.54 4.86 1.94 1.50 3.44 7.07 29.24Average 2.63 3.01 5.63 2.61 2.27 4.88 2.23 1.58 3.80 32.49 22.11

Table 4: Power consumption (mW) based on switching activity in the original ﬂip-ﬂop (FF), converted master-slave latch (M-S), andproposed 3-phase (3-P) latch-based designs

Design FF Total M-S Total 3-PhaseILP Conv TotalISCAS s1196 222 396 5 482 487s1238 245 385 5 487 492s1423 240 425 5 475 480s1488 307 395 5 474 479s5378 307 448 5 288 293s9234 304 422 5 204 209s13207 271 470 7 422 429s15850 253 248 7 324 331s35932 358 667 7 576 583s38417 297 638 9 623 632s38584 204 656 8 346 354CEP AES 1617 4974 13 11524 11537DES3 270 463 5 465 470SHA256 203 400 9 322 331MD5 420 495 12 527 539CPU Plasma 226 549 23 70 93RISC-V 412 1037 17 1005 1022ARM-M0 379 925 29 485 514

Table 5: Run-times (sec) of our experiments

30% beneﬁt over ISCAS, CEP, and CPU master-slave de-signs. The overall power savings drop from 41% to 16% inthe comparison changing from FF to master-slave designs.This can be explained by the fact that latch-based designsoften have less glitching and fewer hold buﬀers than theirFF-based counterparts.Table 4 reports the power dissipation of the resulting de-signs using switch-activity based power analysis assuming aswitching activity of 20% on all inputs (except reset andclocks) and registers. It shows similar savings as in thesimulation-based power analysis shown in Table 3.In summary, our experiments suggest that while signif-icant saving in area and power is possible with our pro-posed approach, the amount of savings is variable and likelydepends on a combination of factors including 1) the per-centage of FFs with combinational feedback that limits thesavings in number of latches and 2) the impact in retiminglatch-based designs on the combinational logic. We shouldalso note that these results are post synthesis and thus do not reﬂect the cost of the multiple clock trees nor the savingsin hold buﬀers, both realized during physical design.The run-time details of the conversion algorithm are re-ported in Table 5. The column labeled “FF Total” showsthe run-times of FF-based synthesis, the next column cor-responds to the run-time of master-slave latch-based designconversion, and the last three columns reports the run-timesspent on solving ILP, converting and retiming, and the totalfor 3-phase latch-based designs. Notice that the run-timesfor most designs, except for AES, are less than 18 minutes,in which at most 29 seconds is consumed by the ILP solver.This suggests that our proposed approach is computation-ally practical for at least moderately-sized blocks. AES hasthe most number of registers (9703 FFs in the original de-sign), and takes the longest time for conversion and retim-ing, i.e. 1 hrs 23 min for master-slave and 3 hrs 12 min for3-phase.

6. CONCLUSIONS

This paper presents an algorithm to automatically converta FF-based design into a 3-phase latch-based design thatuses an ILP to minimize the number of required latches. Ourexperimental synthesis results on a broad range of bench-mark circuits show signiﬁcant savings are possible in botharea and power with practical computational run-times, par-ticularly for pipelined circuits such as multi-stage CPUswhen compared to both FF and master-slave latch-baseddesigns.Our future work includes quantifying these beneﬁts postplace-and-route, including capturing the cost of routing mul-tiple clock trees and the beneﬁts associated with higher tol-erance to PVT variations and increased robustness to holdfailures. In addition, we plan to quantify the advantage ofthis approach when applied to timing and soft-error resilienttemplates in which the decrease in latches also reduces theoverhead of the necessary error detection logic. . REFERENCES [1] R. A. Haring, R. Bellofatto, A. A. Bright, P. G. Crumley,M. B. Dombrowa, S. M. Douskey, M. R. Ellavsky,B. Gopalsamy, D. Hoenicke, T. A. Liebsch, J. A. Marcella,and M. Ohmacht, “Blue gene/L compute chip: Control,test, and bring-up infrastructure,”

IBM Journal ofResearch and Development , vol. 49, no. 2.3, pp. 289–301,March 2005.[2] K. Singh, H. Jiao, J. Huisken, H. Fatemi, and J. P.De Gyvez, “Low power latch based design with smartretiming,” in

Quality Electronic Design (ISQED),International Symposium on . IEEE, 2018, pp. 329–334.[3] M. Pons, T. Le, C. Arm, D. S´everac, J. Nagel, M. Morgan,and S. Emery, “Sub-threshold latch-based icyﬂex2 32-bitprocessor with wide supply range operation,” in , Sept 2016, pp. 33–36.[4] A. P Hurst and R. K Brayton, “The advantages oflatch-based design under process variation,” in

Proceedingsof the IWLS , 2006.[5] M. Fojtik, D. Fick, Y. Kim, N. Pinckney, D. Harris,D. Blaauw, and D. Sylvester, “Bubble razor: Anarchitecture-independent approach to timing-errordetection and correction,” in

Solid-State CircuitsConference Digest of Technical Papers (ISSCC), 2012IEEE International . IEEE, 2012, pp. 488–490.[6] D. Hand, M. T. Moreira, H.-H. Huang, D. Chen, F. Butzke,Z. Li, M. Gibiluka, M. Breuer, N. L. V. Calazans, and P. A.Beerel, “Blade–a timing violation resilient asynchronoustemplate,” in

ASYNC . IEEE, 2015, pp. 21–28.[7] J.-F. Lin, “Low-power pulse-triggered ﬂip-ﬂop design basedon a signal feed-through scheme,”

IEEE Transaction onVery Large Scale Integration (VLSI) Systems , vol. 22,no. 1, pp. 181–185, 2014.[8] S. Paik, L.-e. Yu, and Y. Shin, “Statistical time borrowingfor pulsed-latch circuit designs,” in

Proceedings of the 2010Asia and South Paciﬁc Design Automation Conference .IEEE Press, 2010, pp. 675–680.[9] K. Singh, O. A. R. Rosas, H. Jiao, J. Huisken, and J. P.de Gyvez, “Multi-bit pulsed-latch based low powersynchronous circuit design,” in

Circuits and Systems(ISCAS), 2018 IEEE International Symposium on . IEEE,2018, pp. 1–5.[10] Y. Ding, W. Jin, G. He, and W. He, “Short path paddingwith multiple-Vtcells for wide-pulsed-latch based circuits atultra-low voltage,” in , Oct 2017, pp. 985–988.[11] Y. Shin and S. Paik, “Pulsed-latch circuits: A newdimension in asic design,”

IEEE Design & Test ofComputers , vol. 28, no. 6, pp. 50–57, 2011.[12] K. Yoshikawa, Y. Hagihara, K. Kanamaru, Y. Nakamura,S. Inui, and T. Yoshimura, “Timing optimization byreplacing ﬂip-ﬂops to latches,” in

Proceedings of the Asiaand South Paciﬁc Design Automation Conference . IEEEPress, 2004, pp. 186–191.[13] J. Cortadella, A. Kondratyev, L. Lavagno, and C. P.Sotiriou, “Desynchronization: Synthesis of asynchronouscircuits from synchronous speciﬁcations,”

IEEE Trans. onCAD , vol. 25, no. 10, pp. 1904–1921, 2006.[14] A. Branover, R. Kol, and R. Ginosar, “Asynchronous designby conversion: Converting synchronous circuits intoasynchronous ones,” in

Proceedings of the conference onDesign, Automation and Test in Europe-Volume 2 . IEEEComputer Society, 2004, pp. 870–875.[15] A. Saifhashemi, D. Hand, P. A. Beerel, W. Koven, andH. Wang, “Performance and area optimization of abundled-data Intel processor through resynthesis,” in

ASYNC , May 2014, pp. 110–111.[16] Y. Zhang, H. Cheng, D. Chen, H. Fu, S. Agarwal, M. Lin, and P. A. Beerel, “Challenges in building an open-sourceﬂow from RTL to bundled-data design,” in

AsynchronousCircuits and Systems (ASYNC), IEEE InternationalSymposium on , 2018.[17] H. Cheng, H.-L. Wang, M. Zhang, D. Hand, and P. A.Beerel, “Automatic retiming of two-phase latch-basedresilient circuits,”

IEEE Transactions on Computer-AidedDesign of Integrated Circuits and Systems , 2018.[18] K. A. Sakallah, T. N. Mudge, and O. A. Olukotun,“Optimal clocking of synchronous systems,” in

In ACMInternational Workshop on Timing Issues in theSpeciﬁcation and Synthesis of Digital Systems

Proceedings of the1996 IEEE/ACM international conference onComputer-aided design

Closing the gap betweenASIC & custom: tools and techniques for high-performanceASIC design . Springer Science & Business Media, 2002.[27] F. Brglez, D. Bryan, and K. Kozminski, “Combinationalproﬁles of sequential benchmark circuits,” in