Efficient Parallel Verification of Galois Field Multipliers
aa r X i v : . [ c s . S C ] J a n Efficient Parallel Verification of Galois Field Multipliers
Cunxi Yu, Maciej CiesielskiECE Department, University of Massachusetts, Amherst, [email protected], [email protected]
Abstract -
Galois field (GF) arithmetic is used to implementcritical arithmetic components in communication and security-relatedhardware, and verification of such components is of prime importance.Current techniques for formally verifying such components are basedon computer algebra methods that proved successful in verificationof integer arithmetic circuits. However, these methods are sequentialin nature and do not offer any parallelism. This paper presents analgebraic functional verification technique of gate-level GF (2 m ) multipliers, in which verification is performed in bit-parallel fashion.The method is based on extracting a unique polynomial in Galoisfield of each output bit independently. We demonstrate that thismethod is able to verify an n -bit GF multiplier in n threads.Experiments performed on pre- and post-synthesized Mastrovito and
Montgomery multipliers show high efficiency up to 571 bits.
Keywords — Formal verification; Galois field arithmetic circuits; computeralgebra.
I. I
NTRODUCTION
Galois field (GF) arithmetic is used to implement critical arithmeticcomponents in communication and security-related hardware. It hasbeen extensively applied in many digital signal processing andsecurity applications, such as Elliptic Curve Cryptography (ECC),Advanced Encryption Standard (AES), and others. Multiplication isone of the most heavily used Galois field computations and it is ahigh complexity operation. Specifically, in cryptography systems, thesize of Galois field circuits can be very large. Therefore, developinggeneral formal analysis technique of Galois field arithmetic HW/SWimplementations becomes critical. Contemporary formal techniques,such as
Binary Decision Diagrams (BDDs),
Boolean Satisfiability (SAT),
Satisfiability Modulo Theories (SMT), etc., are not directlyapplicable to either the verification or reverse engineering of Galoisfield arithmetic. The limitations of these techniques when applied toGalois field arithmetic have been addressed in [1].The most successful techniques for verifying arithmetic circuitsuse computer algebra techniques with polynomial representations[1][2][3][4]. The verification problem is typically formulated asproving that the implementation satisfies the specification. This isaccomplished by performing a series of divisions of the specificationpolynomial F by the implementation polynomials B = { f , . . . , f s } ,representing components that implement the circuit. The techniquebased on Gr¨obner Basis demonstrated that this approach can effi-ciently reduce the complexity of the verification problem to mem-bership testing of the specification polynomial in the ideals [1][3].This technique has been applied successfully to large Galois Fieldarithmetic circuits [1]. Symbolic computer algebra methods havebeen used to derive word-level operation for GF circuits and integerarithmetic circuits to improve the verification performance [5][6]. Adifferent approach to arithmetic verification of synthesized gate-levelcircuits has been proposed in [4]. This method uses algebraic rewrit-ing of the polynomials at the primary outputs to extract specificationas a polynomial at the primary inputs.However, a common limitation to all these works is that they arenot applicable to parallel verification. This is because the verification problem based on computer algebra technique expresses the specifica-tion as polynomial in all output bits. In this approach, the polynomialdivision can be done only in a single thread. In principle, multiplespecifications (called output signature in [4]) can be generated bysplitting the output signature. However, we examined this methodand found the performance to be really poor. The reason is that thetechnique of [4] needs to rewrite the entire output signature in allthe output bits to benefit from large monomial cancellations duringrewriting. In other works [1][5], the verification problem of post-synthesized Galois field multipliers have not been addressed.In this work, we extend the verification technique of [4] to verifica-tion of Galois field multipliers, while applying bit-level parallelism.Specifically: • We propose an algorithm for Galois field arithmetic verification,which significantly reduces the internal expression size duringalgebraic rewriting. • We evaluate our approach using benchmarks used in [1][5],including
Mastrovito and
Montgomery multipliers, up to 571bits. The results show that efficiency of our approach surpassesthat of [1] and [5]. • We demonstrate that for the verification problem for an n -bitGalois field multiplier can be accomplished ideally in n parallelthreads. In this work, we set the number of threads to 5, 10, 20,and 30. We also analyze the efficiency of our parallel approachby studying the tradeoff between CPU runtime and memoryusage. • We address the verification of synthesized
Galois field multi-pliers, while previous work dealt only with the verification ofstructural representation (arithmetic netlists) prior to synthesis.II. B
ACKGROUND
Different variants of canonical, graph-based representations havebeen proposed for arithmetic circuit verification, including Binary De-cision Diagrams (BDDs) [7], Binary Moment Diagrams (BMDs) [8],Taylor Expansion Diagrams (TED) [9], and other hybrid diagrams.While the canonical diagrams have been used extensively in logicsynthesis, high-level synthesis and verification, their application toverify large arithmetic circuits remains limited by the prohibitivelyhigh memory requirement for complex arithmetic circuits [4][1].Alternatively, arithmetic verification problems can be modeled andsolved using Boolean satisfiability (SAT) or satisfiability modulo the-ories (SMT). However, it has been demonstrated that these techniquescannot efficiently solve the verification problem of large arithmeticcircuits [1] [10]. Another class of solvers include Theorem Provers,deductive systems for proving that an implementation satisfies thespecification, using mathematical reasoning. However, this techniquerequires manual guidance, which makes it difficult to be appliedautomatically.
A. Computer Algebra Approaches
The most advanced techniques that have potential to solve the arith-metic verification problems are those based on symbolic ComputerAlgebra. These methods model the arithmetic circuit specificationand its hardware implementation as polynomials [1][2][4][5][11].
The verification goal is to prove that implementation satisfies thespecification by performing a series of divisions of the specificationpolynomial F by the implementation polynomials B = { f , . . . , f s } ,representing components that implement the circuit. The polynomials f , ..., f s are called the bases, or generators , of the ideal J . Given aset f , ..., f s of generators of J , a set of all simultaneous solutions toa system of equations f ( x , ..., x n ) =0; ..., f s ( x , ..., x n ) =0 is calleda variety V ( J ) . Verification problem is then formulated as testing ifthe specification F vanishes on V ( J ) In some cases, the test can besimplified to checking if F ∈ J , which is known in computer algebraas ideal membership testing [1].There are two basic techniques to reduce polynomial F modulo B . A standard procedure to test if F ∈ J is to divide polynomial F by the elements of B : f , ..., f s , one by one. The goal is to cancel,at each iteration, the leading term of F using one of the leadingterms of f , ..., f s . If the remainder of the division is r = 0 , then F vanishes on V ( J ) , proving that the implementation satisfies thespecification. However, if r = 0 , such a conclusion cannot be made: B may not be sufficient to reduce F to 0, and yet the circuit maybe correct. To check if F is reducible to zero, a canonical set ofgenerators, G = { g , ..., g t } , called Gr¨obner basis is needed. Thistechnique has been successfully applied to Galois field arithmetic[1] and integer arithmetic circuits [3]. A different approach hasbeen proposed in [4][12][6][13][14], where a gate-level network isdescribed by a system of equations and proved by backward rewriting .Starting with the known output signature (polynomial) in primaryoutput variables, it rewrites the signature from the primary outputs toprimary inputs, to extract an arithmetic function (specification). Thespecific verification work of Galois field arithmetic has been presentedin [1] [5]. These works provide significant improvement compared toother techniques, since their formulations relies on certain simplifyingproperties in Galois field during polynomial reductions. Specifically,the problem reduces to the ideal membership testing over a largerideal that includes J = h x − x i in F . In this paper, we provide acomparison between this technique and our approach. B. Galois Field Multiplication
Galois field (GF) is a number system with a finite number ofelements and two main arithmetic operations, addition and multipli-cation; other operations can be derived from those two [15]. Galoisfield with p elements is denoted as GF ( p ) . The most widely-usedfinite fields are Prime Fields and
Extension Fields , and particularly binary extension fields . Prime field, denoted GF ( p ) , is a finite fieldconsisting of finite number of integers { , , ...., p − } , where p isa prime number, with additions and multiplication performed modulop . Binary extension field, denoted GF (2 m ) (or F m ), is a finite fieldwith m elements; unlike in prime fields, however, the operations inextension fields is not computed modulo m . Instead, in one possiblerepresentation (called polynomial basis), each element of GF (2 m ) is a polynomial ring with m terms with the coefficients in GF (2) .Addition of field elements is the usual addition of polynomials,with coefficient arithmetic performed modulo 2. For example, a 2-bit vector A = { a , a } in GF (2 ) , is A ( x ) = a + a x , where a i ∈ GF (2) = { } . Multiplication of field elements is performed modulo irreducible polynomial P ( x ) of degree m and coefficients in GF (2) .For example, P = x + x + 1 is an irreducible polynomial in GF (2 ) .The irreducible polynomial P ( x ) is analog to the prime number p inprime fields GF ( p ) . Extension fields are used in many cryptographyapplications, such as AES and ECC. In this work, we focus on theverification problem of GF (2 m ) multipliers.An example of multiplication in GF (2 ) is shown in Figure 1. Theleft part of the figure shows a standard 2-bit integer multiplication a a b b a b a b a b a b r r r r a a b b a b a b a b a b s s s s s z z a) b)Fig. 1: 2-bit multiplication: a) standard integer multiplication with4-bit result; b) multiplication in GF (2 ) with A ( x ) = a + a x , B ( x ) = b + b x and result Z ( x ) = z + z x ≡ A ( x ) · B ( x ) mod P ( x ) ;irreducible polynomial P ( x ) = x + x + 1 .with four output bits. To represent the result in GF (2 m ) , which cancontain only two bits, the bits r and r are reduced in GF (2 ) . Thisresult of such a reduction is shown on the right part of the figure.The input and output operands are represented using polynomials A ( x ) , B ( x ) and Z ( x ) . The functions of s , s and s are representedusing polynomials in GF (2) : s = a b , s = a b + a b , and s = a b .Hence, z = s + s and z = s + s . As a result, the coefficients of themultiplication are: z = a b + a b , z = a b + a b + a b . In digitalcircuits, partial products can be implemented using AND gates, andaddition modulo 2 using
XOR gates. Note that, unlike in the integermultiplication in GF (2 m ) circuits there is no carry out to the nextbit. For this reason, as we can see in Figure 1, the function of eachoutput bit is computed independently of other bits. C. Function ExtractionFunction extraction is an arithmetic verification method proposedin [4] for integer arithmetic circuits, in Z m . It extracts a uniquebit-level polynomial function implemented by the circuit directlyfrom its gate-level implementation. Extraction is done by backwardrewriting , i.e., transforming the polynomial representing encoding ofthe primary outputs (called the output signature ) into a polynomialat the primary inputs (the input signature ). This technique has beensuccessfully applied to large integer arithmetic circuits, such as 512-bit integer multipliers. However, it cannot be directly applied tolarge GF multipliers because of exponential size of the intermediatenumber of polynomial terms before cancellations during rewriting.Fortunately, arithmetic GF (2 m ) circuits offer an inherent parallelismwhich can be exploited in backward rewriting. In the rest of the paper,we show how to apply such parallel rewriting in GF (2 m ) circuitswhile avoiding memory explosion experienced in integer arithmeticcircuits. III. P RELIMINARIES
A. Computer Algebraic model
The circuit is modeled as a network of logic elements of arbitrarycomplexity including: basic logic gates (AND, OR, XOR, INV) andcomplex standard cell gates (AOI, OAI, etc.) obtained by synthesisand technology mapping. Instead of modeling Boolean operatorsusing pseudo-Boolean equations, we use the algebraic models in GF (2) , i.e. modulo 2. For example, the pseudo-Boolean model ofXOR( a, b )= a + b − ab is reduced to ( a + b − ab ) mod = ( a + b ) mod . The following algebraic equations are used to describe basiclogic gates in GF (2 m ) , according to [1]: For polynomials in GF (2) , ”+” is computed modulo 2. ¬ a = 1 + aa ∧ b = a · ba ∨ b = a + b + a · ba ⊕ b = a + b (1) B. Outline of the Approach
Similarly to the work of [4], the computed function of the circuitsis specified by two polynomials. The output signature of a GF (2 m ) multiplier, Sig out = P m − i =0 z i x i , with z i ∈ GF (2) . The input signa-ture of a GF (2 m ) multiplier, Sig in = P m − i =0 P i x i , with coefficients P i ∈ GF (2) being product terms, with addition operation performedmodulo 2 (e.g. ( a b + a b ) mod 2). For a GF (2 m ) multiplier, ifthe irreducible polynomial P ( x ) is provided, Sig in is known. Ourgoal is to transform the output signature, Sig out , using polynomialrepresentation of the internal logic elements, into the input signature
Sig in in GF (2 m ) . The the goal of the verification problem is thento check if Sig in = Sig out , expressed in the primary inputs.
Theorem 1:
Given a combinational GF (2 m ) arithmetic circuit,composed of logic gates, described by algebraic expressions (Eq.1), input signature Sig in computed by backward rewriting is uniqueand correctly represents the function implemented by the circuit in GF (2 m ) . Proof:
The proof of correctness relies on the fact that each transfor-mation step (rewriting iteration) is correct. That is, each internal signalis represented by an algebraic expression, which always evaluates toa correct value in GF (2 m ) . This is guaranteed by the correctnessof the algebraic model in Eq. (1), which can be proved easily byinspection. For example, the algebraic expression of XOR(a,b) in Z m is a + b − ab . When implemented in GF (2 m ) , the coefficients inthe expression must be in GF (2) . Hence, XOR(a,b) in GF m isrepresented by a + b . The proof of uniqueness is done by inductionon i , the step of transforming polynomial F i into F i +1 . A detailedinduction proof for expressions in Z m is provided in [4]. (cid:3) Theorem 2:
Let the number of logic elements (polynomials) ina GF (2 m ) multiplier be n . At each iterations, backward rewritingprocess generates n internal expressions, F , F , ..., F n − , such thatevery expression F i ∈ GF (2 m ) . Proof:
Assuming that F = Sig out and each F i ∈ GF (2 m ) , weprove that F i +1 ∈ GF (2 m ) . Each variable in F i represents outputof some logic gate. During the rewriting process, this variable issubstituted by a corresponding polynomial in Eq. (1). Accordingto Theorem 1, resulting polynomial F i +1 correctly represents thefunction F i +1 ∈ GF (2 m ) . (cid:3) Theorems 1 and 2, together with the algebraic model in Eq. (1),provide the basis for polynomial reduction in backward rewriting inthis work. This is described by Algorithm 1. Our method takes thegate-level netlist of a GF (2 m ) multiplier as input and first convertseach logic gate into equations using Eq. (1). The output signature Sig out is required to initialize the backward rewriting. The rewritingprocess starts with F = Sig out , and ends when all the variablesin F i are primary inputs. This is done by rewriting the polynomialsrepresenting logic elements in the netlist in a topological order [4].Each iteration includes two steps: Step 1) substitute the variable ofthe gate output using the expression in the inputs of the gate (Eq.1),and name the new expression F i +1 (lines 3 - 6); Step 2) simplify thenew expression by removing all the monomials (including constants)that evaluate to 0 in GF (2) (line 3 and lines 7 - 10). The algorithmoutputs the function of the design in GF (2 m ) after n iterations, where Algorithm 1
Backward Rewriting in GF (2 m ) Input: Gate-level netlist of GF (2 m ) multiplierInput: Output signature Sig out , and (optionally) input signature,
Sig in Output: GF function of the design, and answer whether
Sig out == Sig in P = { p , p , ..., p n } : polynomials representing gate-level netlist F = Sig out for each polynomial p i ∈ P do for output variable v of p i in F i do replace every variable v in F i by the expression of p i F i → F i +1 for each element/monomial M in F i +1 do if the coefficient of M %2==0 or M is constant, M %2==0 then remove M from F i +1 end if end for end for end for return F n and F n =? Sig in G1G2G3G4 n n G6 G8 G7 n G5 n n n z z a a b b a b a b Fig. 2: The gate-level netlist of post-synthesized and mapped 2-bitmultiplier over GF (2 ) . The irreducible polynomial P ( x ) = x + x + 1 . n is the number of gates in the netlist. The final expression F n can beused for functional verification, by checking if it matches the expectedinput signature (if provided). Example 1 (Figure 2): We illustrate our method using a post-synthesized 2-bit multiplier in GF (2 ) , shown in Figure 2. Theirreducible polynomial is P ( x ) = x + x + 1 . The output sig-nature is Sig out = z + z x , and input signature is Sig in =( a b + a b )+( a b + a b + a b ) x . First, F = Sig out is transformedinto F using polynomial of gate G , z = n + n . This expression issimplified to F = z + n x + n x . Then, the polynomials F i +1 aresuccessively derived from F i and checked for a possible reduction.The first reduction happens when F is transformed into F , where n (at gate G ) is replaced by ( a b ). After simplification,a monomial x is identified and removed from F since 2%2=0.Similar reductions are applied during the transformations F → F and F → F . Finally, the function of the design is extracted byAlgorithm 1. A complete rewriting process is shown in Figure 3.We can see that F = Sig in , which indicates that the circuit indeed Sig out : F = z +x z Eliminating termsG7: F = z +x( n + n ) - G6: F = n + n +x( n + n ) - G5: F = n + n +x( n + n + n ) - G8: F = n + n +x( n + n + n +1) - G4: F = n + n +x( n + n + a b )+2x G3: F = n + n +x( n + a b + a b +1) - G2: F = n + a b +1+x( a b + a b + a b )+2x G1: F = a b + a b +2+x( a b + a b + a b ) Sig in : a b + a b +x( a b + a b + a b ) - Fig. 3: Function extraction of a 2-bit GF multiplier shown in Figure2 using backward rewiring from PO to PI. implements the GF (2 ) multiplication with P ( x ) = x + x + 1 .An important observation is that the potential reductions takeplace only within the expression associated with the same degreeof polynomial ring ( Sig out is a polynomial ring). In other words, thereductions happen independently in a logic cone of every output bit,independently of other bits, regardless of logic sharing between thecones. For example, the reductions in F and F are extracted fromoutput z only. Similarly, in F , the reduction is from z . Theorem 3:
Given a GF (2 m ) multiplier with Sig out = F = z x + z x + ... + z m x m ; and F i = E x + E x + ... + E m x m ,where E i is an algebraic expression in GF (2) obtained duringrewriting. Then, the polynomial reduction is possible only within asingle expression E i , for i =1, 2, ..., m. Proof:
Consider a polynomial E i x n i + E k x n k , where E i and E k are simplified in GF (2) . That is, E i = ( e i + e i + ... ), and E k =( e k + e k + ... ). After simplifying each of the two polynomials, there areno common monomials between E i x n i and E k x n k . This is becausefor any element, e li x n i = e jk x n k , for any pairs of ( i, k ) and ( l, j ) . (cid:3) IV. I
MPLEMENTATION
Gate-levenetlistNetlist to EquationsSigout
Sigout =zmSigout =z2Sigout =z1Equationsof netlistSigout=z0 thread 1thread 2thread 3thread m
Compute final functionReturn Fn … Fig. 4: Overview of parallel verification of GF multipliers.This section describes the implementation of our parallel verifica-tion method for Galois field multipliers. The overview of the proposedtechnique is shown in Figure 4. Our approach takes the gate-levelnetlist as input, and outputs the extracted function of the design. Itincludes four steps:1) Convert the gate-level netlist into algebraic equations. Duringthis step, the gate-level netlist is translated into algebraicequations based on Eq.(1). The equations are levelized intopological order, to be rewritten by backward rewriting inthe next step.2) Split the output signature of GF (2 m ) multipliers into m polynomials with Sig out i = z i . These new signatures are rep-resented by m equation files.3) Split the function of m output bits into m separate functions,each to be processed by a separate thread using Algorithm1. In contrast to work of [4], the internal expression of eachoutput bit does not offer any polynomial reduction ( monomialcancellations ) for other bits.4) Compute the final function of the multiplier. Once the algebraicexpression of each output bit in GF (2) is computed, our method computes the final function by constructing the Sig out using the rewriting process in step 3.
Sig out = z elim Sig out =x · z elimG7: z - G7: x( n + n ) -G6: n + n - G6: x( n + n ) -G5: n + n - G5: x( n + n + n ) -G8: n + n - G8: x( n + n + n )+x -G4: n + n - G4: x( n + n + a a )+2x 2xG3: n + n - G3: x( n + a b + a b )+x -G2: n + a b +1 - G2: x( a b + a b + a b )+2x 2xG1: a b + a b +2 G1: x( a b + a b + a b ) - Sig in = a b + a b +x( a b + a b + a b ) Fig. 5: Parallel extraction of a 2-bit GF multiplier shown in Figure2.
Example 2 (Figure 5): We illustrate our parallel extraction methodusing the 2-bit multiplier in GF (2 ) in Figure 2. The output signature Sig out = z + z x is split into two signatures, Sig out = z and Sig out = z . Then, the rewriting process is applied to Sig out and Sig out in parallel. When Sig out and Sig out have been success-fully extracted, the two signatures are merged as Sig out + Sig out x resulting in the polynomial Sig in . In Figure 3, we can see thatelimination happens three times ( F , F , and F ). According toTheorem 3, we know that the elimination happens within eachelement in GF( n ). In Figure 5, one elimination in Sig out andtwo eliminations in Sig out have been done independently, as shownearlier (Figure 3). V. R ESULTS
The verification technique described in this paper was implementedin C++. It performs backward rewriting with variable substitution andpolynomial reductions in Galois field, using the approach discussedin Sections III and IV. The program was tested on a numberof combinational gate-level GF (2 m ) multipliers taken from [1],including Montgomery multipliers [16] and Mastrovito multipliers[17]. The bit-width of the multipliers varies from 32 to 571 bits. Theexperiments of verifying Galois field multipliers using SAT, SMT,ABC [18] and Singular [19] have been presented in [1] and [5]. Itshows that the rewriting technique performs significantly better thanother techniques. Hence, in this work, we only compare our approachto those of [1] and [5]. Specifically, we compare our approach to thetool described in [5] on the same benchmark set. Our experimentswere conducted on a PC with Intel(R) Xeon CPU E5-2420 v2 2.20GHz x12 with 32 GB memory. As described in the next section, ourtechnique is able to verify Galois field multipliers in multiple threads(up to 30 using our platform). In each thread, Algorithm 1 is appliedon a single output bit. The number of threads is given as input to thetool. A. Evaluation of Our Approach
The experimental results of our approach and comparison against[5] are shown in Table I for gate-level Mastrovito multipliers withbit-width varying from 32 to 571 bits. These multipliers are directlymapped using ABC [18] without any optimization. The largest circuitincludes more than 1.6 million gates. This is also the number ofpolynomial equations and the number of rewriting iterations (seeSection 3). The results generated by the tool, presented in [5] areshown in columns 3 and 4. We performed four different series ofexperiments, with a different number of threads, T =5, 10, 20, and30. The runtime results are shown in columns 6 to 8 and memoryusage in column 9. The timeout limit (TO) was set to 12 hours and Mastrovito [5] This workOp size
T=5 T=10 T=20 T=30 T=1*
32 5,482 0.83 3 1.90 1.54 0.95 1.09 10 MB48 12,228 8.39 13 5.73 3.36 2.83 2.27 21 MB64 21,814 28.90 21 11.08 7.88 6.87 6.74 37 MB96 51,412 195.2 45 38.14 26.69 20.19 22.66 84 MB128 93,996 924.3 91 91.67 62.68 54.99 56.76 152 MB163 153,245 3546 161 192.6 137.5 120.7 113.1 248 MB233 167,803 4933 168 294.1 212.7 180.1 170.6 270 MB283 399,688 30358 380 890.7 606.5 549.7 529.8 642 MB571 1628,170 TO - 7980 5038 MO MO
TABLE I: Results of verifying Mastrovito multipliers using our parallel approach. T is the number of threads. T O =Time out of 12 hours. MO =Memory out of 32 GB.(* T=1 shows the maximum memory usage of each thread.)
Montgomery [5] This workOp size
T=5 T=10 T=20 T=30 T=1*
32 4,352 1.98 3 3.49 2.16 1.31 2.08 8 MB48 9,602 14.19 13 17.71 10.67 9.16 6.01 16 MB64 16.898 63.48 21 44.86 30.57 28.3 27.22 27 MB96 37,634 554.6 45 234.3 157.8 133.1 142.3 59 MB128 66,562 1924 68 208.9 121.3 115.8 110.4 95 MB163 107,582 12063 101 1615.7 1172.3 1094.9 1008.1 161 MB233 219,022 TO
168 722.3 564.8 457.7 479.8 301 MB283 322,622 TO
380 19745 17640 15300 14820 488 MB
TABLE II: Results of verifying
Montgomery multipliers using our parallel approach. T is the number of threads. T O =Time out of 12 hours. MO =Memory out of 32 GB.(* T=1 shows the maximum memory usage of each thread.)memory limit (MO) to 32 GB. The experimental results show that ourapproach provides on average 26.2x, 37.8x, 42.7x, and 44.3x speedup,for T =
5, 10, 20, and 30 threads, respectively. Our approach canverify the multipliers up to 571 bit-wide multipliers in 1.5 hours,while that of [5] fails after 12 hours.Note that the reported memory usage of our approach is themaximum memory usage per thread . This means that our toolexperiences maximum memory usage with all T threads running inthe process; in this case, the memory usage is T · Mem . This is whythe 571-bit Mastrovito multipliers could be successfully verified with T = 5 and 10, but failed with T = 20 and 30 threads. For example,the peak memory usage of 571-bit Mastrovito multiplier with T = 20 is . ×
20 = 52
GB, which exceeds the available memory limit.We also tested Montgomery multipliers with bit-width varyingfrom 32 to 283 bits. These experiments are different than those in[5]. In our work, we first flatten the Montgomery multipliers beforeapplying our verification technique. That is, we assume that onlythe positions of the primary inputs and outputs are known, withoutthe knowledge of the internal structure or clear boundaries of theblocks inside the design. The results are shown in Table II. For 32-to 163-bit Montgomery multipliers, our approach provides on averagea 9.2x, 15.9x, 16.6x, and 17.4x speedup, for T =
5, 10, 20, and 30,respectively. Notice that [5] cannot verify the flattened Montgomerymultipliers larger than 233 bits in 12 hours.In Table II, we observe that CPU runtime for verifying a 163-bitmultiplier is greater than that of a 233-bit multiplier. This is becausethe computation complexity depends not only on the bit-width of themultipliers, but also on the irreducible polynomial P ( x ) .To analyze this dependency, we studied the effects of P ( x ) on4-bit multiplications implemented using different irreducible poly-nomials. The results are reported in Figure 6). We can see thatwhen P ( x ) = x + x + 1 , the longest logic paths for z and z ,include ten and seven products that need to be generated using XORs,respectively. However, when P ( x ) = x + x +1 , the two longest paths, z and z , have only seven and six products. This means that the GF( ) multiplication requires XOR operations using P ( x ) andrequires XOR operations using P ( x ) . In other words, the gate-levelimplementation of the multiplier implemented using P ( x ) has moregates compared to P ( x ) . In conclusion, we can see that irreduciblepolynomial P ( x ) has significant impact on both design cost and theverification cost of the GF( m ) multipliers. a a a a b b b b a b a b a b a b a b a b a b a b a b a b a b a b a b a b a b a b s s s s s s s P ( x ) = x + x + 1 s s s s s s s s s s s s s z z z z P ( x ) = x + x + 1 s s s s s s s s s s z z z z Fig. 6: Analysis of the computation complexity of Galois fieldmultipliers with different irreducible polynomials using two 4-bitGF multiplications, which are implemented using x + x + 1 and x + x + 1 . B. Runtime and Memory Tradeoff
In this section, we discuss the tradeoff of runtime and memoryusage of our approach. The plots in Figure 7 show how the averageruntime and memory usage change with different number of threads.The vertical axis on the left is CPU runtime (in seconds), and on theright is memory usage (MB). Horizontal axis represents the numberof threads T , ranging from 5 to 30. The runtime is significantlyimproved for T between 5 and 15. However there is not muchspeedup when T is greater than 20, most likely due to the memory A v e r age r un t i m e ( s e c ) A v e r age M e m o r y u s age ( M B ) Mas-runtimeMas-memoryMont-runtimeMont-memory
Fig. 7: Runtime and memory usage of our parallel verificationapproach as a function of number of threads T .management synchronization overhead between the threads. Based onthe experiments of Mastrovito multipliers (Table I), our approach islimited by the memory usage when the size of multiplier and T arelarge. In our work, T = 20 seems to be the best choice. Obviously, T varies on different platform depending on the number of cores, andthe memory. C. Verification of Synthesized GF Multipliers
In [10], the authors conclude that highly bit-optimized integer arith-metic circuits are harder to verify than their original, pre-synthesizednetlists. This is because efficiency of the rewriting technique relieson the amount of cancellations between the different terms of thepolynomial, and the cancellations strongly depend on the order inwhich signals are rewritten. A good ordering of signals is difficult tobe achieved in highly bit-optimized circuits.In order to see the effect of synthesis on parallel verificationof GF circuits, we applied our approach to post-synthesized
Galoisfield multipliers with operands up to 409 bits (571-bit multiplierscould not be synthesized in a reasonable time). We synthesized
Mastrovito and
Montgomery multipliers using
ABC tool [18]. Werepeatedly used the commands resyn2 and dch until ABC couldnot reduce the number of levels or the number of nodes any more.The synthesized multipliers were mapped using a 14nm technologylibrary. The verification experiments shown in Table III are performedby our tool with T = 20 threads. Our tool was able to verify both 409-bit Mastrovito and
Montgomery multipliers within just 13 minutes.We observe that the Galois field multipliers are much easier to beverified after optimization. For example, the verification of a 283-bit Montgomery multiplier takes 15,300 seconds when T =
20. Afteroptimization, the runtime was just 169.2 seconds, which is 90x fasterthan verifying the original implementation. The memory usage is alsoreduced from 488 MB to 194 MB. In summary, in contrast to [10],the bit-level optimization actually reduces the complexity of backwardrewriting process. This is because extracting the function of an outputbit of a GF multiplier depends only on the logic cone of this bit anddoes not require logic from other bits to be simplified (see Theorem3). Hence, the complexity of function extraction is naturally reducedif logic cone is minimized.VI. C
ONCLUSION
In this paper, we present an algebraic functional verificationtechnique of gate-level GF (2 m ) multipliers, in which verification isperformed in bit-parallel fashion. The method is based on extractinga unique polynomial in Galois field of each output bit independently.We demonstrate that this method is able to verify an n -bit GFmultiplier in n threads, while applying on pre- and post-synthesized ”dch” is the most efficient bit-optimization function in ABC. Op size Mastrovito Montgomery
Runtime Mem Runtime Mem64 4.25 s 21 MB 15.3 s 38 MB96 10.9 s 44 MB 40.5 s 54 MB128 28.9 s 77 MB 27.1 s 78 MB163 62.3 s 123 MB 205.2 s 153 MB233 134.8 s 201 MB 141.4 s 199 MB283 168.4 s 198 MB 169.2 s 194 MB409 775.6 s 635 MB 750.6 s 597 MB
TABLE III: Runtime and memory usage of synthesized
Mastrovito and
Montgomery multipliers ( T =20). Mastrovito and
Montgomery multipliers up to 571 bits. The resultsshow that our parallel approach gives average 44 × and 17 × speedupcompared to the best existing algorithm. In addition, we analyzethe effects of irreducible polynomial and synthesis on verificationof GF( m ) multipliers. Acknowledgment:
This work was supported by an award fromNational Science Foundation, No. CCF-1319496 and No. CCF-1617708. R
EFERENCES [1] J. Lv, P. Kalla, and F. Enescu, “Efficient Grobner Basis Reductions forFormal Verification of Galois Field Arithmatic Circuits,”
IEEE Trans.on CAD , vol. 32, no. 9, pp. 1409–1420, September 2013.[2] E. Pavlenko, M. Wedler, D. Stoffel, W. Kunz et al. , “Stable: A new qf-bvsmt solver for hard verification problems combining boolean reasoningwith computer algebra,” in
DATE , 2011, pp. 155–160.[3] A. Sayed-Ahmed, D. Große, U. K¨uhne, M. Soeken, and R. Drechsler,“Formal verification of integer multipliers by combining grobner basiswith logic reduction,” in
DATE’16 , 2016, pp. 1–6.[4] M. Ciesielski, C. Yu, W. Brown, D. Liu, and A. Rossi, “Verification ofGate-level Arithmetic Circuits by Function Extraction,” in .ACM, 2015, pp. 52–57.[5] T. Pruss, P. Kalla, and F. Enescu, “Equivalence Verification of LargeGalois Field Arithmetic Circuits using Word-Level Abstraction viaGr¨obner Bases,” in
DAC’14 , 2014, pp. 1–6.[6] C. Yu and M. J. Ciesielski, “Automatic word-level abstraction ofdatapath,” in
ISCAS’16 , 2016, pp. 1718–1721.[7] R. E. Bryant, “Graph-based algorithms for boolean function manipula-tion,”
IEEE Trans. on Computers , vol. 100, no. 8, pp. 677–691, 1986.[8] R. E. Bryant and Y. Chen, “Verification of arithmetic circuits with binarymoment diagrams,” in
Proceedings of the 32st Conference on DesignAutomation, San Francisco, California, USA, Moscone Center, June 12-16, 1995. , 1995, pp. 535–541.[9] M. Ciesielski, P. Kalla, and S. Askar, “Taylor Expansion Diagrams: ACanonical Representation for Verification of Data Flow Designs,”
IEEETrans. on Computers , vol. 55, no. 9, pp. 1188–1201, Sept. 2006.[10] C. Yu, W. Brown, D. Liu, A. Rossi, and M. J. Ciesielski, “Formalverification of arithmetic circuits using function extraction,”
IEEETrans. on CAD of Integrated Circuits and Systems , vol. 35, no. 12,pp. 2131–2142, 2016.[11] O. Wienand, M. Wedler, D. Stoffel, W. Kunz, and G.-M. Greuel, “AnAlgebraic Approach for Proving Data Correctness in Arithmetic DataPaths,”
CAV , pp. 473–486, July 2008.[12] C. Yu, W. Brown, and M. J. Ciesielski, “Verification of arithmeticdatapath designs using word-level approach - A case study,” in , 2015, pp. 1862–1865.[13] S. Ghandali, C. Yu, D. Liu, B. Walter, , and M. Ciesielski, “LogicDebugging of Arithmetic Circuits,” in
IEEE Computer Society AnnualSymposium on VLSI (ISVLSI) . IEEE, 2015, pp. 113–118.[14] C. Yu and M. Ciesielski, “Formal Verification using Don’t-care andVanishing Polynomials,” in . IEEE, 2016, pp. 284–289.[15] C. Paar and J. Pelzl,
Understanding cryptography: a textbook forstudents and practitioners . Springer Science & Business Media, 2009.[16] C. K. Koc and T. Acar, “Montgomery multiplication in gf (2k),”
Designs, Codes and Cryptography , vol. 14, no. 1, pp. 57–69, 1998. [17] B. Sunar and C¸ . K. Koc¸, “Mastrovito multiplier for all trinomials,”
Computers, IEEE Transactions on , vol. 48, no. 5, pp. 522–527, 1999.[18] A. Mishchenko et al. , “Abc: A system for sequential synthesis andverification,” , 2007.[19] W. Decker, G.-M. Greuel, G. Pfister, and H. Sch¨onemann, “S