[PDF] Delay Optimization of Combinational Logic by And-Or Path Restructuring

Abstract

We propose a dynamic programming algorithm that constructs delay-optimized circuits for alternating And-Or paths with prescribed input arrival times. Our algorithm fulfills best-known approximation guarantees and empirically outperforms earlier methods by exploring a significantly larger portion of the solution space. Our algorithm is the core of a new timing optimization framework that replaces critical paths of arbitrary length by logically equivalent realizations with less delay. Our framework allows revising early decisions on the logical structure of the netlist in a late step of an industrial physical design flow. Experiments demonstrate the effectiveness of our tool on 7nm real-world instances.

Full PDF

DDelay Optimization of Combinational Logicby A ND -O R Path Restructuring

Ulrich Brenner and Anna Hermann

Research Institute for Discrete Mathematics, University of Bonn { brenner,hermann } @dm.uni-bonn.de Abstract —We propose a dynamic programming algorithm that con-structs delay-optimized circuits for alternating A ND -O R paths withprescribed input arrival times. Our algorithm fulﬁlls best-known ap-proximation guarantees and empirically outperforms earlier methods byexploring a signiﬁcantly larger portion of the solution space.Our algorithm is the core of a new timing optimization frameworkthat replaces critical paths of arbitrary length by logically equivalentrealizations with less delay. Our framework allows revising early decisionson the logical structure of the netlist in a late step of an industrial physicaldesign ﬂow. Experiments demonstrate the effectiveness of our tool on 7nmreal-world instances. I. I

NTRODUCTION

In VLSI design, logic synthesis turns the abstract logic speciﬁ-cation of a chip into a concrete representation in terms of gates.This happens very early in the design process, and for the followingsteps, the logical description typically remains ﬁxed. However, duringphysical design, it may turn out that the chosen implementation ofthe logic functionality was not the best choice, e.g., with respect toplacement or timing. Now it would be desirable to ﬁnd a better suitedlogically equivalent representation.We propose an algorithm that improves timing by logic restructur-ing of critical combinational paths. Optimizing a path boils downto optimizing an A ND -O R path, i.e., a Boolean function of type t ∧ ( t ∨ ( t ∧ ( t ∨ ( t ∧ ( . . . t m − ) . . . ) , see [24].Besides, A ND -O R paths have an important application in theconstruction of adder circuits. The carry bit computation in an adder(which is the critical part) is equivalent to the evaluation of an A ND -O R path. The tasks of A ND -O R path and adder optimization areactually equivalent concerning timing if circuit size is disregarded.Many efﬁcient adder circuits (e.g., [4], [13], [14]) have beenproposed in the previous decades and could hence be used foroptimizing A ND -O R paths. In terms of depth, the best approximationguarantee for A ND -O R path circuits has been proven by [8]. However,these approaches optimize circuit depth , yielding fast circuits onlyif all input signals arrive simultaneously. In our setting on the mosttiming-critical path, this will rarely be the case. Instead, we minimizecircuit delay , a generalization of circuit depth that takes into accountindividual prescribed input arrival times .Some algorithms for adder optimization regard input arrival times,but most lack provable guarantees: For adders with general arrivaltimes, there are a greedy heuristic [25] and a dynamic program [16],but for both, no approximation ratio can be shown. In [21], the delayof adders is evaluated regarding arrival times computed after physicaldesign, but the optimization goal is depth and not delay.Algorithms for A ND -O R path optimization with input arrival timesthat achieve provably good approximation ratios are presented in [3],[20] and [10]. We will explain their ideas in Section II-B. The methodof [20] is used in [24] to optimize general logic paths.Our goal is to restructure critical paths of any length with provablygood approximation guarantees. In contrast, many other approachessynthesize whole netlists and thus arbitrary Boolean functions. Asin general, ﬁnding a logically equivalent implementation of a givencircuit with, say, minimum depth is an NP-hard problem, these approaches only replace sub-circuits of constant size by alternativerealizations (see e.g., [1], [5], [17], [19], [22]). Here, the new solutionis logically correct by construction, but an extension to larger sub-circuits is hardly possible.Our main contributions are: • We propose a new dynamic program for delay optimization ofA ND -O R paths. In fact, the algorithm solves a more generalproblem, the optimization of so-called extended A ND -O R paths.We describe how decisions on the structure of sub-solutionscan be postponed until these sub-solutions are combined. Thisreduces rounding effects that are inherent in previous algorithms. • Our algorithm fulﬁlls best known theoretical delay guarantees asit is a common generalization of all previously best approaches[3], [10], [20]. Moreover, we demonstrate in experiments thatwe improve delay signiﬁcantly compared to those. • We compute lower bounds on the best possible delay of A ND -O R paths. On 89% of our test instances, the result of our algo-rithm matches the lower bound and is thus provably optimum. • We propose a framework for timing optimization of combina-tional paths of arbitrary length based on [24] with our A ND -O R path restructuring algorithm as a core routine. The generic delaymodel used in our core algorithm allows incorporating physicallocations. Our framework contains several classical timing-optimization tools and – in contrast to the simple mapping usedin [24] – an evolved technology-mapping method [6]. • Experiments on recent industrial 7nm chips show the efﬁciencyand effectiveness of our framework. We improve worst slack andtotal slack considerably without any impact on other metrics.The rest of the paper is organized as follows. In Section II,we deﬁne the A ND -O R path optimization problem, survey knownapproaches, and present our new approximation algorithm. Section IIIdescribes our logic restructuring framework. Experimental results areshown in Section IV, and Section V contains concluding remarks.II. A ND -O R P ATH O PTIMIZATION

Note that in this section, we use a simpliﬁed linear delay modelwith unit gate delay and zero wire delay. In Section III-A, we willgeneralize this model to adapt to our application in physical design.

A. Problem formulation

For us, a circuit C is a connected acyclic digraph whose nodescan be partitioned into two sets: inputs with no incoming edgesrepresenting Boolean variables, and gates representing an elementaryBoolean function (mostly A ND R

2, i.e., A ND and O R gateswith fan-in two), where only a single gate out(C) called output hasno outgoing edges. An A ND - O R path on inputs t , . . . , t m − is aBoolean formula of type g ( t , . . . , t m − ) = t ∧ ( t ∨ ( t ∧ ( t ∨ ( t ∧ ( . . . t m − ) . . . ) or g ∗ ( t , . . . , t m − ) = t ∨ ( t ∧ ( t ∨ ( t ∧ ( t ∨ ( . . . t m − ) . . . ) . On the left-hand side of Figure 1, a circuit for the A ND -O R path g ( t , t , t , t , t ) is shown. Given individual arrival times a ( t i ) ∈ R a r X i v : . [ c s . D S ] S e p t t t t t t t t t t Fig. 1: Two circuits computing the function g ( t , . . . , t ) = t ∧ ( t ∨ ( t ∧ ( t ∨ t ))) with input and gate arrival times.for each input signal t i , i = 0 , . . . , m − , we ask for a Booleancircuit computing g ( t , . . . , t m − ) that consists of A ND R time unit, so the gate arrivaltime is the maximum of its predecessors’ arrival times plus . Byscanning a circuit C from the inputs to the output, we can computearrival times at all gates. The delay of a circuit is deﬁned as the arrivaltime at out(C) . Summarizing, we study the following problem: A ND -O R P ATH O PTIMIZATION

Instance: m ∈ N , Boolean input variables t = ( t , . . . , t m − ) ,arrival times a ( t ) , . . . , a ( t m − ) ∈ R . Task:

Compute a circuit C using only A ND R g ( t ) or g ∗ ( t ) with minimum possible delay.Figure 1 shows how gate arrival times are computed in two circuitsthat both realize the A ND -O R path g ( t , t , t , t , t ) . The circuitshave a delay of 7 and 6, respectively. Note that in the special casewhen all input arrival times are , circuit delay is exactly circuitdepth, i.e., the length of a longest directed path.Given an instance consisting of inputs t , . . . , t m − with arrivaltimes a ( t ) , . . . , a ( t m − ) , we deﬁne the weight W := (cid:80) m − i =0 a ( t i ) .It is not too difﬁcult to see that (cid:100) log ( W ) (cid:101) is a lower bound for thedelay of any binary circuit computing an A ND -O R path for inputs t , . . . , t m − with arrival times a ( t ) , . . . , a ( t m − ) ∈ N (this boilsdown to Kraft’s inequality [15]; see [20] for a concise proof).Optimizing g ( t ) and g ∗ ( t ) is equivalently hard: By the dualityprinciple of Boolean algebra, any circuit for g ( t ) consisting of A ND and O R gates can be transformed into a circuit for g ∗ ( t ) with thesame delay by exchanging A ND and O R gates and vice versa. B. Previous Algorithms

A common approach for A ND -O R path optimization is the appli-cation of recursion formulas that allow reducing the problem to theconstruction of circuits for A ND -O R paths with fewer inputs.The algorithm by Rautenbach et. al [20] is based on the followingequation (for λ ∈ N with λ < m − ): g ( t , . . . , t m − ) = g ( t , . . . , t λ − ) (1) ∨ (cid:0) t ∧ t ∧ t ∧ · · · ∧ t λ − ∧ g ( t λ , . . . , t m − ) (cid:1) To see the correctness of (1), check that g ( t , . . . , t m − ) is trueexactly in the following two cases: • g ( t , . . . , t λ − ) is true (then the other inputs do not matter) • g ( t λ , . . . , t m − ) is true and the value “true” is propagated tothe output because the inputs t , t , t , . . . , t λ − are all trueSee [20] for a detailed proof. Using formula (1), an A ND -O R pathcircuit on inputs t , . . . , t m − can be constructed by combining A ND -O R path circuits on inputs t , . . . , t λ − and on inputs t λ , . . . , t m − and a circuit for a multi-input A ND on the inputs t , t , . . . , t λ − .Using (1) in a dynamic program with running time O ( m ) , theauthors of [20] construct A ND -O R path circuits with delay at most .

441 log ( W )+3 . Held and Spirkl [10] obtain a slightly better delaybound of .

441 log ( W ) + 2 . using the dual of the followingequation (for λ with λ < m − ): g ( t , . . . , t m − ) = g ( t , . . . , t λ ) (2) ∧ (cid:0) ( t ∨ t ∨ . . . ∨ t λ +1 ) ∨ g ( t λ +2 , . . . , t m − ) (cid:1) Their algorithm runs in time O ( m log m ) as they explicitly choose λ in each recursion step. The proof of (2) is analogous to the proofof (1), but here one should check in which cases the two formulasare false. We will use (2) in a slightly different equivalent form (notethat t λ +1 ∨ g ( t λ +2 , . . . , t m − ) = g ∗ ( t λ +1 , . . . , t m − ) ): g ( t , . . . , t m − ) = g ( t , . . . , t λ ) (3) ∧ (cid:0) ( t ∨ t ∨ . . . ∨ t λ − ) ∨ g ∗ ( t λ +1 , . . . , t m − ) (cid:1) As (1) and (3) contain functions combining a multi-input A ND orO R with an A ND -O R path, we deﬁne for t = ( t , . . . , t m − ) and ≤ i ≤ j ≤ k < m with j − i even the extended A ND - O R paths φ i,j,k := t i ∧ t i +2 ∧ . . . ∧ t j − ∧ t j − ∧ g ( t j , . . . , t k ) and φ ∗ i,j,k := t i ∨ t i +2 ∨ . . . ∨ t j − ∨ t j − ∨ g ∗ ( t j , . . . , t k ) . The extended A ND -O R path φ , , is depicted in Figure 2(a). Fromthe splits (1) and (3), using extended A ND -O R paths as a moreﬂexible replacement for sub-functions, we deduce the splits φ , ,m − = φ , , λ − ∨ φ , λ,m − for ≤ λ ≤ m − (4) φ , ,m − = φ , , λ ∧ φ ∗ , λ +1 ,m − for ≤ λ ≤ m − (5)that can be generalized to extended A ND -O R paths as in φ i,j,k = φ i,j,j +2 λ − ∨ φ i,j +2 λ,k for ≤ λ ≤ k − j , (6) φ i,j,k = φ i,j,j +2 λ ∧ φ ∗ j +1 ,j +2 λ +1 ,k for ≤ λ ≤ k − j − . (7)Note that in (6) and (7), the functions on the right-hand side dependon fewer inputs than φ i,j,k . Figure 1 shows an example for split (5)with λ = 1 , and Figures 2(a) and 2(b) for split (7) with λ = 2 .Using split (7) and its dual, Grinchuk [8] proves the upper bound log m + log log m + 3 on the depth of A ND -O R path circuits,and Brenner and Hermann [3] give an algorithm for arbitrary integerarrival times with running time O ( m log m ) and a delay bound of log W + log log m + log log log m + 4 . . (8)In the special case when k − j ≤ , φ i,j,k is actually a multi-inputA ND , and the function can be realized by a delay-optimum circuitusing a greedy algorithm called Huffman coding : Theorem 1 (Golumbic [7], based on Huffman [11]) . Given inputs t , . . . , t m − with arrival times a ( t i ) , a delay-optimum circuit forthe Boolean function t ∧ . . . ∧ t m − (or t ∨ . . . ∨ t m − ) can be con-structed in O ( m log m ) time. If a ( t i ) ∈ N for all i = 0 , . . . , m − ,then the delay of an optimum circuit is (cid:100) log ( W ) (cid:101) .C. Our Approach We present an algorithm for A ND -O R path optimization withprescribed input arrival times that generalizes any of the algorithmsin [3], [10], [20]. In particular, on any instance, the delay of oursolution is at least as good as the delay computed with any of thethree algorithms, and on most instances, it is better, cf. Section IV.To simplify notations, hereafter we assume that all arrival timesare integral. Still, our implementation allows arbitrary arrival times.Recall that in the A ND -O R path optimization problem, we aimat computing a circuit containing only fan-in- gates. However, inintermediate steps, we allow a larger fan-in for the gate computingthe output of the circuit. This leads to the following deﬁnition. t t t t t t t t t t ( t i , t j − ) ( t j , . . . , t k ) (a) A simple circuit realizing f . t t t t t t t t t t t ( t i , t j − ) ( t j , . . . , t j +2 λ ) ( t j +2 λ +1 , . . . , t k ) out ( C ) out ( C ) c (b) Illustration of split (7) on f . t t t t t t t t t t t ( t i , t j − ) ( t j , . . . , t j +2 λ ) ( t j +2 λ +1 , . . . , t k ) out ( C i,j,k ) (c) Output of Algorithm 1 on Figure 2(b). Fig. 2: A possible way to construct circuit C i,j,k realizing φ i,j,k in Algorithm 2 with i = 0 , j = 4 , k = 12 and split (7) with λ = 2 . In thisexample, we use naive implementations for C and C . Deﬁnition 2. An undetermined circuit is a Boolean circuit C consist-ing of A ND and O R gates only such that all gates with the possibleexception of out(C) have fan-in two. With given input arrival times,the weight of C is weight ( C ) := (cid:80) ki =1 d i , where d , . . . , d k arethe arrival times at the predecessors of out(C) .In Figure 1, the weight of the left and right undetermined circuitis + 2 = 68 and + 2 = 48 , respectively. Figure 2(c) displaysan undetermined circuit with fan-in at the output gate.For undetermined circuits, we do not yet specify how we realizethe output gate by fan-in-2 gates. This allows greater ﬂexibility whencombining several such circuits to a larger circuit. The followinglemma shows that optimizing the weight of an undetermined circuitcan be used to compute fan-in-2 circuits with small delay. Lemma 3.

Given an undetermined circuit C , we can construct aBoolean circuit using A ND O R C with delay at most (cid:100) log ( weight ( C )) (cid:101) .Proof. Apply Huffman coding with the predecessors of out(C) asinputs (see Theorem 1).Algorithm 2 states our overall dynamic programming algorithm forA ND -O R path optimization on inputs t , . . . , t m − , which works asfollows: We compute a cubic-size table that contains undeterminedcircuits A i,j,k and O i,j,k realizing the extended A ND -O R path φ i,j,k for all ≤ i ≤ j ≤ k ≤ m − and j − i even, where out( A i,j,k ) = A ND and out( O i,j,k ) = O R . In particular, this computes circuits forthe entire A ND -O R path φ , ,m − = g ( t , . . . , t m − ) .Note that when k = j or k = j + 1 , the function φ i,j,k is amultiple-input A ND , hence, in Line 4, an optimum solution can befound by Huffman coding (see Theorem 1). To compute undeterminedcircuits for φ i,j,k with j − i even and k > j + 1 , we assume thatwe have already computed undetermined circuits for φ for instanceswith fewer inputs. Then, in Line 6, we can enumerate all possiblechoices of λ in the splits (6) and (7) to recursively compute acircuit C for φ i,j,k from pre-computed solutions (while dualizingone sub-circuit accordingly in split (7)). Since the combination oftwo undetermined circuits is not necessarily an undetermined circuit,we apply Algorithm 1. Here, in Line 7, we ﬁx the structure ofthe undetermined sub-circuit C i as a circuit C (cid:48) i over { A ND , O R } .Figure 2 shows an example of split (7). In Algorithm 2, the circuit C is stored in a candidate list C of undetermined circuits for φ i,j,k . Theundetermined circuits among C with the best weight with an A ND or O R gate at the output are stored as A i,j,k in Line 7 and O i,j,k inLine 8, respectively.As ﬁnal circuit for φ , ,m − , we choose the weight-minimumcircuit among A , ,m − and O , ,m − in Line 9, made a circuitover { A ND , O R } by Lemma 3. Algorithm 1:

Merging undetermined circuits. Input:

Undetermined circuits C and C computing Booleanfunctions h and h ; a gate type ◦ ∈ { A ND , O R } . Output:

An undetermined circuit C computing h ◦ h . Add a ◦ gate c to the union of the circuits C and C . for i ← to do Let c , . . . , c k be the predecessors of out ( C i ) . if out ( C i ) is a ◦ gate then Remove out ( C i ) and add edges ( c , c ) , . . . , ( c k , c ) . else Use Lemma 3 to construct a circuit C (cid:48) i from C i . Add an edge from out ( C (cid:48) i ) to c . Algorithm 2: A ND -O R path optimization. Input:

Boolean variables t , . . . , t m − with arrival times a ( t ) , . . . , a ( t m − ) ∈ N . Output:

A Boolean circuit computing g ( t , . . . , t m − ) . for l ← to m do for ≤ i ≤ j ≤ k < m , j − i even s.t. φ i,j,k has l inputs do if k ∈ { j, j + 1 } then // φ i,j,k multi-input A ND A i,j,k := circuit computed by Huffman coding. else C := list of undetermined circuits for φ i,j,k arisingfrom applying split (6) or (7) with any valid λ ,followed by a call to Algorithm 1. A i,j,k := argmin { W ( C ) : C ∈ C , out(C) = A ND } . O i,j,k := argmin { W ( C ) : C ∈ C , out(C) = O R } . C := argmin { W ( A , ,m − ) , W ( O , ,m − ) } . return Circuit C (cid:48) resulting from applying Lemma 3 to C . Theorem 4.

Algorithm 2 computes a circuit with delay at most log ( W ) + log log ( m ) + log log log ( m ) + 4 . and can be implemented to run in time O ( m ) .Proof. (S KETCH ) Algorithm 2 considers, in particular, all recursionsteps from [3]. Using this, one can show that for any sub-instance φ i,j,k , the algorithm computes a solution which is at least as good asthe solution computed by the algorithm from [3] and thus also meetsthe delay bound (8). The running time is dominated by O ( m ) callsto Algorithm 1, which can be implemented to run in constant time ifonly weights and delays are computed and only the ﬁnal circuit C (cid:48) in Line 10 of Algorithm 2 is actually constructed. esNo Has slack improved by at least ? min Preoptimization: Apply detailed optimization to P.Revert changes oflast preoptimization. Normalize S and extract anAnd-Or path S' from S.

NoYes

Has slack improved by at least in last iterations? num it Apply Algorithm 2 to S'.Apply technology mapping to S.

For each sub-path S of P with length at most : max start Store S in list L ofrestructuring candidates. Sort L by decreasing estimated slack gain.Pop k candidates from L andtentatively apply detailedoptimization to each of them.

Yes No

Is ? ≥ Choose the candidate C with bestactual slack gain seen so far. Relax . t Yes

Is and has nosubpath slack decreasedbeyond P's initial slack? ≥ min Initialize . := target end end loop New iteration:

Choose a critical path P. No Implement netlist change C.

Fig. 3: Flow chart for our logic optimization framework (cf. Section III) with the path restructuring step in green.We conjecture that a stronger theoretical delay bound can be provenfor our algorithm.In Section IV, we will see that in our practically applied logic op-timization framework, the running time of Algorithm 2 is negligible.In order to take care of the circuit size, we can modify Algorithm 2as follows: For each sub-instance φ i,j,k , we store not just one circuitwith the best delay per output gate type, but all non-dominatedcircuits. Here, circuit C dominates circuit C (cid:48) if both weight and sizeof C are at least as good as in C (cid:48) and if the gate types of out ( C ) and out ( C (cid:48) ) coincide. In the end, we choose C to be the smallestamong all weight-optimum circuits. This does not affect the delay ofthe circuit (and Theorem 4 still holds), but often reduces its size.III. L OGIC O PTIMIZATION F RAMEWORK

We propose a timing optimization framework (cf. Figure 3) basedon Werber et al. [24] with Algorithm 2 as an essential component thatis used in production in a late pre-routing stage of an industrial physi-cal design ﬂow. Our framework revises the logical structure of criticalpaths using placement and timing information. In Section III-A, weadapt the delay model used in Algorithm 2 to respect placement,buffering and gate sizing effects. As we do not fully account fordifferent kinds of gates or different gate sizes that might be available,our framework involves a technology mapping step (Section III-B)and powerful gate sizing and buffering routines (Section III-C).We iteratively optimize the worst slack of the currently mosttiming-critical combinational path until overall worst slack does notimprove signiﬁcantly anymore. A single iteration works as follows:Let P denote a most critical path. During a preoptimization step,we ﬁrst try to improve the slack of P without changing its logicalstructure in order to diminish disruptions. To this end, we apply detailed optimization to P as described in Section III-C. If a thresholdslack improvement of δ min is exceeded, we keep the changes imposedby preoptimization and start the next iteration.Otherwise, we discard the preoptimization’s changes and performthe path restructuring step (central, green part of Figure 3). This stepworks on internal data structures; the netlist is not changed beforedetailed optimization (Section III-C). We consider the possibility tooptimize any sub-path S of P up to a maximum length of m max .First, we apply a normalization (Section III-A) in order to extract anA ND -O R path S (cid:48) from S on which we run Algorithm 2. Then, the technology mapping routine from [6] (see also Section III-B) locally modiﬁes S to beneﬁt from all available gate types. After havingoptimized all sub-paths of P , we store all restructuring possibilitiesin a list L , sorted by decreasing estimated slack gain.For only the most promising fraction of restructuring options, weapply the time-consuming detailed optimization (cf. Section III-C).First, we tentatively apply detailed optimization to the topmost k candidates in L . If the actual slack gain of the best solution exceeds δ target , we choose this solution; otherwise, we iteratively decrease δ target by a ﬁxed value and try out the next k candidates in L until we reach δ target or L is empty. Afterwards, we choose therestructuring candidate C with best actual slack gain δ C for P amongall detailed-optimized solutions. This way, we usually apply detailedoptimization to only a few instances, but still ﬁnd the overall bestrestructuring option. If δ C ≥ δ min and if no side path slack hasworsened beyond the initial slack of P , we implement this netlistchange, possibly retaining parts of P needed for side outputs. If thechange is implemented and the slack gain over the last num it iterationsexceeds a threshold δ it , we start the next iteration; otherwise, we stop.Note that this is a simpliﬁed ﬂow description. E.g., in practice, weoptimize the second critical path or the most critical latch-to-latchpath when P cannot be further optimized. A. Normalization

Our A ND -O R path optimization algorithm from Section II expectsas an input an alternating path of A ND R P contains arbitrary gates with varying delays, and thephysical locations of the path inputs might be far apart, inducingundeniably high wire delays even after buffering. A normalization step thus transforms P into a piece of netlist whose core part is anA ND -O R path with appropriately modiﬁed input arrival times.As we work on the most critical path, the buffering routine appliedin Section III-C will compute delay-optimum solutions. Thus, we canassume a linear wire delay and estimate the wire delay between twophysical positions p and p by d dist · || p − p || for a constant d dist ∈ R . The traversal time through a gate is approximated by aconstant d gate ∈ R . The constants d gate and d dist are chosen based on ananalysis of typical values on the respective design. As on the criticalpath, there are rather low fan-outs and slews, the delay of gates withdifferent types and sizes still varies, but not much in comparison to t t t t t t t t t t t Fig. 4: A subpath S of the critical path P before (left) and afternormalization (right). On the right, the extracted A ND -O R path S (cid:48) iscolored. Critical wires are drawn in red.the differences in arrival times. Hence, assuming a realistic constantgate delay sufﬁces to determine the logical structure of the circuit.Since we work on the most timing-critical part of the design, weplace the circuit C computed by Algorithm 2 such that each path isembedded delay-optimally, implying that each path from an input t i to out(C) has a wire delay of d dist · || l ( t i ) − l (out(C)) || , where l indicates physical coordinates on the chip. Thus, the delay of C is max Q : t i (cid:32) out(C) (cid:8) a ( t i ) + d dist · || l ( t i ) − l (out(C)) || + d gate · | Q | (cid:9) , where the maximum ranges over all paths Q in C from any input t i to out(C) . Applying Algorithm 2 with modiﬁed arrival times a (cid:48) ( t i ) := 1 d gate (cid:16) a ( t i ) + d dist · || l ( t i ) − l (out(C)) || (cid:17) hence yields a circuit with optimum wire delay with respectto physical locations. In fact, we choose a placement that isnetlength-optimum among all delay-optimum placements: We deter-mine l (out(C)) based on its successors in the netlist and place eachgate at the median position of its predecessors and out(C) .Now, we can describe our normalization. Let x denote the mostcritical input of a sub-path S of P . We represent each gate in S usingA ND and I NV gates only. This does not necessarily yield a path, butwe can recover the original critical path by following the signal ﬂowof x , obtaining a path S (cid:48) . By applying De Morgan transformationsin reverse topological order, we ensure that S (cid:48) contains A ND andO R gates only, possibly adding inverters at the inputs of S (cid:48) . Weuse Huffman coding (Theorem 1) on chains of A ND gates (or O R gates) in S (cid:48) to move less critical gates into S \ S (cid:48) , respecting physicallocations by modifying arrival times as above. This way, S (cid:48) becomesan A ND -O R path that – with input arrival times a (cid:48) – can be passedto Algorithm 2. Figure 4 depicts the normalization on a path S (left)containing inverters (bubbles), N OR , and O AI gates. On the right, weshow S after normalization with the A ND -O R path S (cid:48) colored. B. Technology Mapping

The purpose of our technology mapping step is to change the newlycreated circuit locally to improve worst slack and the physical areaoccupied by gates by making use of all gates available on the design.We use the dynamic programming algorithm from Elbert [6] whichcovers the input circuit by graphs representing the available gatetypes. With respect to any ﬁxed tradeoff of arrival time (regardingour timing model from Section III-A, but with speciﬁc estimateddelays per gate type) and number of gates, this algorithm computesan optimum technology mapping, but the running time grows expo-nentially in the number l of gates with more than one successor. In ourapplication, l is usually very small, hence we can effort this runningtime (cf. the end of Section IV). For constant l , [6] also providesa fully polynomial-time approximation scheme. On general circuits,computing a size-optimum technology mapping is NP-hard [12]. C. Detailed Optimization

Depending on the actual stage of the design, our detailed opti-mization step invokes buffering, layer assignment and gate sizing tools. When used in late physical design, we apply Held’s gate sizingroutine [9], followed by the buffering tool with an integrated layerassignment by Bartoschek et al. [2]. After buffering, we apply gatesizing again, in particular on newly inserted buffers. As we work onthe most critical fraction of the design, V t assignment can be doneconveniently by using the fastest gates available.An incremental placement legalization makes sure that the place-ment remains legal throughout all netlist changes.IV. E XPERIMENTAL R ESULTS

In a ﬁrst set of experiments, we examined the A ND -O R pathoptimization algorithm from Section II separately. To this end, wecreated A ND -O R path instances with 4 to 28 inputs and randomintegral arrival times chosen uniformly from the interval [0 , inputs ] .For each number of inputs, we created 1000 instances.We compared our results with the previously best methods [3],[10], and [20]. For each instance, we ran all three algorithmsand compared the best result in terms of delay to our algorithm’soutput. Figure 5 visualizes our results. Instances are grouped by theirnumbers of inputs, and colors indicate the absolute delay differenceof computed solutions. Our algorithm covers all recursion optionsfrom [3], [10], [20], so our solutions can never be worse. In fact, onalmost all instances, the delay of our circuit is better, and already for inputs, on every other instance better by or more.For each instance, we computed a lower bound on delay basedon the following ideas: First, Kraft’s inequality [15] imposes a lowerbound on the delay of any binary circuit; secondly, we enumeratepossible local gate conﬁgurations near the output of an A ND -O R path circuit C and recursively compute lower bounds for sub-circuits.We compared our delay to the resulting lower bound. Among allour solutions,

89 % achieve the lower bound and hence are provablydelay-optimum, and only .

012 % exceed the lower bound by .Figure 6 compares our realization with [10] on an exampleinstance. In our circuit, the splits (6) ∗ , (7) and (7) ∗ were applied,and the ability to optimize undetermined circuits was used twice.This way, our delay of is better than the delay found by [10], andit is even optimum since the input with arrival time has to traverseat least gates in any solution. On this instance, we need one moregate than [10]. In general, the number of gates used by our algorithm(with our modiﬁcation for size reduction) is typically higher than in[3], [10], [20], but mostly in the range of

20 % .In a second set of experiments, we examined our logic optimizationframework as a whole. Table I shows results on recent 7nm pre-routing designs using the RICE delay model. The ’init’ row displaysthe state of the chips as in our application in industry: a timing-driven i n s t a n c e delay gain01234 Fig. 5: Delay gain of the solutions computed by Algorithm 2compared to the best solution among [3], [10], [20] on instanceswith random integral input arrival times. (7) ∗ , λ = 0 , undet.(7), λ = 0 , undet.(7), λ = 2 (6) ∗ , λ = 1 Fig. 6: Three logically equivalent A ND -O R path circuits. The circuit on the left has delay and size , the circuit in the middle computedby [10] delay and size , and our circuit on the right delay and size . In our circuit, we indicate the splits used by Algorithm 2. Unit Run WS [ps] TS [ns]

201 15 . LO

188 15 . − .

02 % 0 .

00 % 86 % 12 i2 init

62 52 . LO

58 52 . .

02 % +0 .

04 % 96 % 11 i3 init

109 192 .

93 189 . .

01 % 0 .

00 % i4 init . LO . − .

06 % − .

07 % 99 % 59 i5 init

159 345 . LO

152 343 . .

02 % 0 .

00 % 94 % 287 i6 init

34 13 . LO

20 8 . .

00 % +0 .

01 % 88 % 228 i7 init

92 251 . LO

77 230 . .

03 % +0 .

06 % 95 % 525 i8 init

136 850 . LO

120 833 . .

01 % +0 .

02 % 90 % 249

TABLE I: Performance of our logic restructuring framework on 7nmreal-world instances.placement has been computed, followed by various timing optimiza-tion steps, among those our buffering and gate sizing sub-routines.The initial netlist cannot be improved any further by classical timingoptimization. The ’LO’ row shows results after applying our logicoptimization ﬂow to this netlist. We see that worst slack (WS) and thetotal sum of negative slacks (TS) mostly improve signiﬁcantly duringlogic optimization. This does not disrupt global objectives as area,number of gates, netlength, and routability, which barely change. Tocheck routability, we use the ACE5 estimate from [23], the averagecongestion of the 5 % most congested resources, weighted by usage,computed by the global router from [18].Our program was implemented in C++, and all tests were executedon a machine with two Intel(R) Xeon(R) CPU E5-2667 v2 processors,using a single thread. In the last column (T), we show the totalrunning time of our ﬂow, which is largely dominated by gate sizingbecause it performs many expensive queries to the timing engine.On any design, the total running time of all calls to Algorithm 2is less than second, and less than seconds for the whole pathrestructuring step. Per design, we consider roughly 1500 A ND -O R path restructuring instances with up to inputs.V. C ONCLUSION

We presented a new approximation algorithm for delay optimiza-tion of A ND -O R paths and a logic optimization framework using thisalgorithm to improve critical paths in late physical design. Regardinga simple, but realistic delay model, our algorithm fulﬁlls best knownmathematical guarantees, outperforms previously best approaches andis often optimum. Results on industrial 7nm designs demonstrate thatour logic optimization framework improves timing when traditionaltiming optimization tools are at an end. R EFERENCES [1] L. Amar´u, M. Soeken, P. Vuillod, J. Luo, A. Mishchenko, P.-E. Gail-lardon, J. Olson, R. Brayton, and G. De Micheli. Enabling exact delaysynthesis.

ICCAD , pages 352–359, 2017.[2] C. Bartoschek, S. Held, D. Rautenbach, and J. Vygen. Fast bufferingfor optimizing worst slack and resource consumption in repeater trees.

ISPD , pages 43–50, 2009.[3] U. Brenner and A. Hermann. Faster carry bit computation for addercircuits with prescribed arrival times.

TALG , 15(4):45:1–45:23, 2019.[4] R. P. Brent and H.-T. Kung. A regular layout for parallel adders.

Trans.Comput. , 31(3):260–264, 1982.[5] J. Cortadella. Timing-driven logic bi-decomposition.

TCAD , 22(6):675–685, 2003.[6] L. Elbert. Aproximationsalgorithmen im Technology Mapping. Bache-lor’s thesis, University of Bonn, 2017. German.[7] M. C. Golumbic. Combinatorial merging.

Trans. Comput. , 25(11):1164–1167, 1976.[8] M. I. Grinchuk. Sharpening an upper bound on the adder and comparatordepths.

J. Appl. Ind. Math. , 3(1):61–67, 2009.[9] S. Held. Gate sizing for large cell-based designs.

DATE , pages 827–832,2009.[10] S. Held and S. Spirkl. Fast preﬁx adders for non-uniform input arrivaltimes.

Algorithmica , 77(1):287–308, 2017.[11] D. A. Huffman. A method for the construction of minimum-redundancycodes.

Proc. Inst. Radio Eng. , 40(9):1098–1101, 1952.[12] K. Keutzer and D. Richards. Computational complexity of logicsynthesis and optimization.

IWLS , 1989.[13] V. M. Khrapchenko. Asymptotic estimation of the addition time of aparallel adder.

Systems Theory Research , 19:105–122, 1970.[14] P. M. Kogge and H. S. Stone. A parallel algorithm for the efﬁcientsolution of a general class of recurrence equations.

Trans. Comput. ,100(8):786–793, 1973.[15] L. G. Kraft.

A Device for Quantizing, Grouping, and Coding Amplitude-Modulated Pulses . PhD thesis, MIT, 1949.[16] J. Liu, S. Zhou, H. Zhu, and C.-K. Cheng. An algorithmic approach forgeneric parallel adders.

ICCAD , pages 734–740, 2003.[17] A. Mishchenko, R. Brayton, S. Jang, and V. Kravets. Delay optimizationusing sop balancing.

ICCAD , pages 375–382, 2011.[18] D. M¨uller, K. Radke, and J. Vygen. Faster min–max resource sharingin theory and practice.

MPC , 3(1):1–35, 2011.[19] S. M. Plaza, I. L. Markov, and V. Bertacco. Optimizing non-monotonicinterconnect using functional simuation and logic restructuring.

ISPD ,pages 92–102, 2008.[20] D. Rautenbach, C. Szegedy, and J. Werber. Delay optimization oflinear depth Boolean circuits with prescribed input arrival times.

JDA ,4(4):526–537, 2006.[21] S. Roy, M. Choudhury, R. Puri, and D. Z. Pan. Towards optimalperformance-area trade-off in adders by synthesis of parallel preﬁxstructures.

TCAD , pages 1517–1530, 2014.[22] L. Stok, D. Kung, D. Brand, A. D. Drumm, A. J. Sullivan, L. Reddy,N. Hieter, D. J. Geiger, H. H. Chao, and P. J. Osler. BooleDozer:Logic synthesis for ASICs.

IBM Journal of Research and Development ,40:407–430, 1996.[23] Y. Wei, C. Sze, N. Viswanathan, Z. Li, C. J. Alpert, L. Reddy, A. D.Huber, G. E. Tellez, D. Keller, and S. S. Sapatnekar. Glare: Global andlocal wiring aware routability evaluation.

DAC , pages 768–773, 2012.[24] J. Werber, D. Rautenbach, and C. Szegedy. Timing optimization byrestructuring long combinatorial paths.

ICCAD , pages 536–543, 2007.[25] W.-C. Yeh and C.-W. Jen. Generalized earliest-ﬁrst fast additionalgorithm.