Formal Analysis of Galois Field Arithmetics - Parallel Verification and Reverse Engineering
IIEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS 1
Formal Analysis of Galois Field Arithmetic Circuits - Parallel Verification and Reverse Engineering
Cunxi Yu
Student Member, IEEE, and Maciej Ciesielski,
Senior Member, IEEE
Abstract —Galois field (GF) arithmetic circuits find numerousapplications in communications, signal processing, and securityengineering. Formal verification techniques of GF circuits arescarce and limited to circuits with known bit positions of theprimary inputs and outputs. They also require knowledge of theirreducible polynomial P ( x ) , which affects final hardware im-plementation. This paper presents a computer algebra techniquethat performs verification and reverse engineering of GF( m )multipliers directly from the gate-level implementation. Theapproach is based on extracting a unique irreducible polynomialin a parallel fashion and proceeds in three steps: 1) determinethe bit position of the output bits; 2) determine the bit positionof the input bits; and 3) extract the irreducible polynomialused in the design. We demonstrate that this method is able toreverse engineer GF( m ) multipliers in m threads. Experimentsperformed on synthesized Mastrovito and
Montgomery multiplierswith different P ( x ) , including NIST-recommended polynomials,demonstrate high efficiency of the proposed method. Index Terms —Galois field arithmetic, computer algebra, for-mal verification, reverse engineering, parallelism.
I. I
NTRODUCTION D ESPITE considerable progress in verification of randomand control logic, advances in formal verification ofarithmetic circuits have been lagging. This can be attributedto the difficulty in efficient modeling of arithmetic circuitsand datapaths, without resorting to computationally expensiveBoolean methods. Contemporary formal techniques, such as
Binary Decision Diagrams (BDDs),
Boolean Satisfiability (SAT),
Satisfiability Modulo Theories (SMT), etc., are notdirectly applicable to verification of integer and finite fieldarithmetic circuits [1] [2]. This paper concentrates on formalverification and reverse engineering of finite (Galois) fieldarithmetic circuits.Galois field (GF) is a number system with a finite numberof elements and two main arithmetic operations, addition andmultiplication; other operations can be derived from those two[3]. GF arithmetic plays an important role in coding theory,cryptography, and their numerous applications. Therefore,developing formal techniques for hardware implementationsof GF arithmetic circuits, and particularly for finite fieldmultiplication, is essential.The elements in field GF( m ) can be represented usingpolynomial rings. The field of size m is constructed using irreducible polynomial P ( x ) , which includes terms of degree C. Yu and M. Ciesielski are with the Department of Electrical andComputer Engineering, University of Massachusetts, Amherst, MA, 01375.The related tools and benchmarks are released publicly on Github, ycunxi.github.io/Parallel Formal Analysis GaloisField
E-mail: [email protected] with d ∈ [ , m ] with coefficients in GF(2). The arithmeticoperation in the field is then performed modulo P ( x ) . Thechoice of the irreducible polynomial has a significant impacton the hardware implementation of the GF circuit and itsperformance. Typically, the irreducible polynomial with aminimum number of elements gives the best performance [4],but it is not always the case.Due to the rising number of threats in hardware security,analyzing finite field circuits becomes important. Computeralgebra techniques with polynomial representation seem tooffer the best solution for analyzing arithmetic circuits. Sev-eral works address the verification and functional abstractionproblems, both in Galois field arithmetic [1] [5] [6] and integerarithmetic implementations [7] [2] [8] [9] [10]. Symboliccomputer algebra methods have also been used to reverseengineer the word-level operations for GF circuits and integerarithmetic circuits to improve verification performance [11][12] [5]. The verification problem is typically formulated asproving that the implementation satisfies the specification,and is accomplished by performing a series of divisions ofthe specification polynomial by the implementation polyno-mials. In the work of Yu et al. [11], the authors proposedan original spectral method based on analyzing the internalalgebraic expressions during the rewriting procedure. Sayed-Ahmed et al. [12] introduced a reverse engineering techniquein Algebraic Combinational Equivalence Checking (ACEC)process by converting the function into canonical polynomialsand using Gr¨obner Basis .However, the above mentioned algebraic techniques haveseveral limitations. Firstly, they are restricted to implementa-tions with a known binary encoding of the inputs and out-puts. This information is needed to generate the specificationpolynomial that describes the circuit functionality regardingits inputs and outputs, necessary for the polynomial reductionprocess (described in Section II-D). Secondly, these methodsare unable to explore parallelism (inherent in GF circuits),as they require that the polynomial division is applied itera-tively using reverse-topological order [2] [9] [6]. Thirdly, theapproaches applied specifically to GF( m ) arithmetic circuits[5] [6], require knowledge of the irreducible polynomial P ( x ) of the circuit.In this work, we present a formal approach to reverseengineer the gate-level finite field arithmetic circuits thatexploit inherent parallelism of the GF circuits. The methodis based on a parallel algebraic rewriting approach [13] andapplied specifically to multipliers. The objective of reverseengineering is as follows: given the netlist of a gate-levelGF multiplier, extract the bit positions of input and output a r X i v : . [ c s . S C ] F e b EEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS 2 bits and the irreducible polynomial used in constructing theGF multiplication; then extract the specification of the designusing this information. Bit position i indicates the location ofthe bit in the binary word according to its significance (LSBvs MSB). Our approach solves this problem by transformingthe algebraic expressions of the output bits into an algebraicexpression of the input bits (specification), and is done in par-allel for each output bit. Specifically, it includes the followingsteps : • Extract the algebraic expression of each output bit. • Determine the bit position of the outputs. • Determine the bit position of the inputs. • Extract the irreducible polynomial P ( x ) . • Extract the specification by algebraic rewriting.We demonstrate the efficiency of our method using GF( m ) Mastrovito and
Montgomery multipliers of up to 571-bit widthin a bit-blasted format (i.e., flattened to bit-level), implementedusing various irreducible polynomials.II. B
ACKGROUND
A. Canonical Diagrams
Several approaches have been proposed to check an arith-metic circuit against its functional specification. Differentvariants of canonical, graph-based representations have beenproposed, including Binary Decision Diagrams (BDDs) [14],Binary Moment Diagrams (BMDs) [15] [16], Taylor Expan-sion Diagrams (TED) [17], and other hybrid diagrams. WhileBDDs have been used extensively in logic synthesis, theirapplication to verification of arithmetic circuits is limitedby the prohibitively high memory requirement for complexarithmetic circuits, such as multipliers. BDDs are being used,along with many other methods, for local reasoning, but notas monolithic data structure [18]. BMDs and TEDs offer abetter space complexity but require word-level information ofthe design, which is often not available or is hard to extractfrom bit-level netlists. While the canonical diagrams have beenused extensively in logic synthesis, high-level synthesis, andverification, their application to verify large arithmetic circuitsremains limited by the prohibitively high memory requirementof complex arithmetic circuits [2] [1].
B. SAT, ILP and SMT Solvers
Arithmetic verification problems have been typically mod-eled using Boolean satisfiability (SAT). Several SAT solvershave been developed to solve Boolean decision problems,including ABC [19], MiniSAT [20], and others. Some ofthem, such as CryptoMiniSAT [21], target specifically
XOR -rich circuits, and are potentially useful for arithmetic circuitverification, but are all based on a computationally expensiveDPLL (Davis, Putnam, Logemann, Loveland) decision pro-cedure [22]. Some techniques combine automatic test patterngeneration (ATPG) and modular arithmetic constraint-solvingtechniques for the purpose of test generation and assertion Our tool and benchmarks used in this journal paper are released publiclyat our project website athttps://ycunxi.github.io/Parallel Formal Analysis GaloisField checking [23]. Others integrate linear arithmetic constraintswith Boolean SAT in a unified algebraic domain [24], buttheir effectiveness is limited by constraint propagation acrossthe Boolean and word-level boundary. To avoid this problem,methods based on ILP models of arithmetic operators havebeen proposed [25] [26], but in general ILP techniques areknown to be computationally expensive and not scalable tolarge scale systems.
SMT solvers depart from treating the prob-lem in a strictly Boolean domain and integrate different well-defined theories (Boolean logic, bit vectors, integer arithmetic,etc.) into a DPLL-style SAT decision procedure [27]. Some ofthe most effective SMT solvers, potentially applicable to ourproblem, are Boolector [28], Z3 [29], and CVC [30]. However,SMT solvers still model functional verification as a decisionproblem and, as demonstrated by extensive experimental re-sults, neither SAT nor SMT solvers can efficiently solve theverification problem of large arithmetic circuits [1] [10].
C. Theorem Provers
Another class of solvers include Theorem Provers, deduc-tive systems for proving that an implementation satisfies thespecification, using mathematical reasoning. The proof systemis based on a large and strongly problem-specific databaseof axioms and inference rules, such as simplification, termrewriting, induction, etc. Some of the most popular theoremproving systems are: HOL [31], PVS [32], ACL2 [33], andthe term rewriting method described in [34]. These systemsare characterized by high abstraction and powerful logicexpressiveness, but they are highly interactive, require intimatedomain knowledge, extensive user guidance, and expertise forefficient use. The success of verification using theorem proversdepends on the set of available axioms and rewrite rules, andon the choice and order in which the rules are applied duringthe proof process, with no guarantee for a conclusive answer[35].
D. Computer Algebra Approaches
The most advanced techniques that have potential to solvethe arithmetic verification problems are those based on Sym-bolic Computer Algebra. The verification problem is typicallyformulated as a proof that the implementation satisfies thespecification [2] [1] [8] [7] [9]. This is accomplished byperforming a series of divisions of the specification polynomial F by a set of implementation polynomials B , representingcircuit components, the process referred to as reduction of F modulo B . Polynomials f , ..., f s ∈ B are called the bases, or generators , of the ideal J . Given a set f , ..., f s of generatorsof J , a set of all simultaneous solutions to a system ofequations f =0; ... , f s =0 is called a variety V ( J ) . Verificationproblem is then formulated as a test if the specification F vanishes on V ( J ) , i.e., if F ∈ V ( J ) . This is known incomputer algebra as ideal membership testing problem [1].Standard procedure to test if F ∈ V ( J ) is to dividepolynomial F by the polynomials { f , ..., f s } of B , one byone. The goal is to cancel, at each iteration, the leading term of F using one of the leading terms of f , ..., f s . If the remainder r of the division is 0, then F vanishes on V ( J ) , proving EEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS 3 that the implementation satisfies the specification. However,if r (cid:54) = 0 , such a conclusion cannot be made; B may notbe sufficient to reduce F to 0, and yet the circuit may becorrect. To reliably check if F is reducible to zero, a canonical set of generators, G = { g , ..., g t } , called Gr¨obner basis ,is needed. It has been shown that for combinational circuitswith no feedback, certain conditions automatically make theset B a Groebner basis [36]. Specifically, if the polynomials f , ..., f s ∈ B are ordered in reverse topological order of logicgates, from primary outputs to primary inputs, and the leadingterm of each polynomial is the output of a logic gate, then set B is automatically a Groebner basis. Some of the authors useGaussian elimination, rather than explicit polynomial division,to speed up the polynomial reduction process [1] [8]. Thepolynomials corresponding to fanout-free logic cones can beprecomputed to reduce the size of the problem [8].The polynomial reduction technique has been successfullyapplied to both integer arithmetic circuits [9] and Galois fieldarithmetic [1]. Verification work of Galois field arithmetichas been presented in [1] [5]. Formulation of problems inGF arithmetic takes advantage of known properties of Galoisfield during polynomial reductions. Specifically, the problemreduces to the ideal membership testing over a larger ideal thatincludes ideal J = (cid:104) x − x (cid:105) in F , for each internal signal x of the circuit. Inclusion of this ideal basically assures thateach signal assumes a binary value. In this paper, we providecomparison between this technique and our approach. E. Function ExtractionFunction extraction is an arithmetic verification methodoriginally proposed in [2] for arithmetic circuits in modular in-teger arithmetic Z m . It extracts a unique bit-level polynomialfunction implemented by the circuit directly from its gate-levelimplementation. Instead of expensive polynomial division,extraction is done by backward rewriting , i.e., transformingthe polynomial representing encoding of the primary outputs(called the output signature ) into a polynomial at the primaryinputs (the input signature ) using algebraic models of the logicgates of the circuit. That is, the rewriting is performed in areverse topological order. This technique has been successfullyapplied to large integer arithmetic circuits, such as 512-bitinteger multipliers. However, it is not directly applicable tolarge Galois Field multipliers because of potentially exponen-tial number of polynomial terms, before the internal term can-cellations takes place during rewriting. Fortunately, arithmeticGF( m ) circuits offer an inherent parallelism which can beexploited in backward rewriting, without memory explosion.In the rest of the paper, we first describe how to apply suchparallel rewriting in GF( m ) circuits while avoiding memoryexplosion experienced in integer arithmetic circuits. Using thisapproach, we extract the function of each output bit in F m and the function is represented in a pseudo-Boolean polyno-mial expression, where all variables are Boolean. Finally, wepropose a method to reverse engineer the GF( m ) designs byanalyzing these expressions. III. G ALOIS F IELD M ULTIPLICATION
Galois field (GF) is a number system with a finite numberof elements and two main arithmetic operations, addition andmultiplication; other operations such as division can be derivedfrom those two [3]. Galois field with p elements is denotedas GF( p ). The most widely-used finite fields are Prime Fields and
Extension Fields , and particularly
Binary Extension Fields .Prime field, denoted GF( p ), is a finite field consisting offinite number of integers { , , ...., p − } , where p is a primenumber, with additions and multiplication performed modulop . Binary extension field, denoted GF( m ) (or F m ), is afinite field with m elements. Unlike in prime fields, however,the operations in extension fields are not computed modulo m . Instead, in one possible representation (called polynomialbasis ), each element of GF( m ) is a polynomial ring with m terms with coefficients in GF(2), modulo P ( x ) . Additionof field elements is the usual addition of polynomials, withcoefficient arithmetic performed modulo 2. Multiplication offield elements is performed modulo irreducible polynomial P ( x ) of degree m and coefficients in GF(2). The irreduciblepolynomial P ( x ) is analogous to the prime number p in primefields GF ( p ) . In this work, we focus on the verification prob-lem of GF( m ) multipliers that appear in many cryptographyand in some DSP applications. A. GF Multiplication Principle
Two different GF multiplication structures, constructed us-ing different irreducible polynomials P ( x ) and P ( x ) , areshown in Figure 1. The integer multiplication takes two n -bit operands as input and generates a n -bit word, where thevalues computed at lower significant bits ripple through thecarry chain all the way to the most significant bit (MSB). Incontrast, in GF( m ) implementations the number of outputs isreduced to n using irreducible polynomial P(x). The productterms are added for each column (output bit position) modulo2, hence there is no carry propagation. For example, torepresent the result in GF( ), with only four output bits, thefour most significant bits in the result of the integer multi-plication have to be reduced to GF( ). The result of such areduction is shown in Figure 1. In GF( ), the input and outputoperands are represented using polynomials A ( x ) , B ( x ) and Z ( x ) , where A ( x ) = (cid:80) n =3 n =0 a n · x n , B ( x ) = (cid:80) n =3 n =0 b n · x n , and Z ( x ) = (cid:80) n =3 n =0 z n · x n , respectively. Example 1:
The function of each multiplicationbit s i ( i ∈ [0, 6]) is represented using polynomialsin GF(2), namely: s = a b , s = a b + a b , etc. ...,up to s = a b . The output bits z n ( n ∈ [0, 3])are computed modulo the irreducible polynomial P ( x ) . Using P ( x ) = x + x +1, we obtain : z = s + s , z = s + s + s , z = a b + a b + a b + a b + a b + a b , and z = a b + a b + a b + a b + a b . The coefficients of themultiplication results are shown in Figure 2. In digitalcircuits, partial products are implemented using AND gates,and addition modulo 2 is done using
XOR gates. Note that,unlike in integer multiplication, in GF( m ) circuits there is For polynomials in GF(2), ”+” are computed as modulo 2.
EEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS 4 no carry out to the next bit. For this reason, as we can seein Figure 1, the function of each output bit can be computedindependently of other bits. a a a a b b b b a b a b a b a b a b a b a b a b a b a b a b a b a b a b a b a b s s s s s s s P ( x ) = x + x + 1 s s s s s s s s s s s s s z z z z P ( x ) = x + x + 1 s s s s s s s s s s z z z z Figure 1:
Two GF( ) multiplications constructed using P ( x ) = x + x + 1 and P ( x ) = x + x + 1 . output polynomial expression z ( a b )+ a b + a b + a b z ( a b + a b )+ a b + a b + a b + a b + a b z ( a b + a b + a b )+ a b + a b + a b z ( a b + a b + a b + a b )+ a b Figure 2: Extracted algebraic expressions of the four outputbits of GF( ) multiplier for P ( x ) = x + x + 1 . B. Irreducible Polynomials
In general, there are various irreducible polynomials thatcan be used for a given field size, each resulting in a differ-ent multiplication result. For constructing efficient arithmeticfunctions over GF( m ), the irreducible polynomial is typi-cally chosen to be a trinomial, x m + x a +1, or a pentanomial x m + x a + x b + x c +1 [37]. For efficiency reason, coefficients m, a are chosen such that m - a ≥ m/ .An example of constructing GF( ) multiplication using twodifferent irreducible polynomials is shown in Figure 1. Wecan see that each polynomial produces a unique multiplica-tion result. The size of the corresponding multiplier can beestimated by counting the number of XOR operations in eachmultiplication. Since the number of AND and XOR operationsfor generating partial products (variables s i in Figure 1) isthe same, the difference is only caused by the reduction ofthe corresponding polynomials modulo P ( x ) . The number oftwo-input XOR operations introduced by the reduction with P ( x ) can be obtained as the number of terms in each columnminus one. For example, the number of XORs using P ( x ) is 3+1+2+3=9; and using P ( x ) , the number of XORs is1+2+2+1=6.As will be shown in the next section, given the structure ofthe GF( m ) multiplication, such as the one shown in Figure 1,one can readily identify the irreducible polynomial P ( x ) usedduring the GF reduction. This can be done by extracting the terms s k corresponding to the entry s m (here s ) in the tableand generating the irreducible polynomial beyond x m . Weknow that P ( x ) must contain x m , and the remaining terms x k of P ( x ) are obtained from the non-zero terms correspondingto the entry s m . For example, for the irreducible polynomial P ( x ) = x + x + x , the terms x and x are obtained bynoticing the placement of s in columns z and z . Similarly,for P ( x ) = x + x + x , the terms x and x are obtainedby noticing that s is placed in columns z and z . The reasonfor it and the details of this procedure will be explained in thenext section.IV. P ARALLEL E XTRACTION IN G ALOIS F IELD
In this section, we introduce our method for extracting theunique algebraic expressions of the output bits (e.g. Figure2) using computer algebraic method. This can be used toverify the GF( m ) multipliers when the binary encoding ofinputs and output and the irreducible polynomial are given.We introduce a parallel function extraction framework inGF( m ), which allows us to individually extract the algebraicexpression of each output bit. This framework is used forreverse engineering, since our reverse engineering approachis based on analyzing the algebraic expression of output bitsin GF(2), as introduced in Section I. A. Computer Algebraic model
The circuit is modeled as a network of logic elements ofarbitrary complexity, including basic logic gates (AND, OR,XOR, INV) and complex standard cell gates (AOI, OAI, etc.)generated by logic synthesis and technology mapping. Weextend the algebraic model of Boolean operators developed in[10] for integer arithmetic to finite field arithmetic in GF (2) ,i.e., modulo 2. For example, the pseudo-Boolean model ofXOR( a, b )= a + b − ab is reduced to ( a + b + 2 ab ) mod = ( a + b ) mod . The following algebraic equations are used todescribe basic logic gates in GF (2 m ) [1]: ¬ a = 1 + aa ∧ b = a · ba ∨ b = a + b + a · ba ⊕ b = a + b (1) B. Outline of the Approach
Similarly to the work of [2] and [10], the arithmeticfunction computed by the circuits is obtained by transforming(rewriting) the polynomial representing the encoding of theprimary outputs (called output signature ) into the polynomialat the primary inputs, the input signature . The output sig-nature of a GF (2 m ) multiplier, Sig out = (cid:80) m − i =0 z i x i , with z i ∈ GF (2) . The input signature of a GF (2 m ) multiplier, Sig in = (cid:80) m − i =0 P i x i , with coefficients P i ∈ GF (2) beingproduct terms, and addition operation performed modulo 2.If the irreducible polynomial P ( x ) is provided, Sig in isknow; otherwise, it will be computed by backward rewritingfrom Sig out . The goal is to transform the output signature,
Sig out , using polynomial representation of the internal logic
EEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS 5 elements (1), into an input signature
Sig in in GF (2 m ) , whichdetermines the arithmetic function (specification) computed bythe circuit. Theorem 1:
Given a combinational arithmetic circuit in GF (2 m ) , composed of logic gates, described by Eq. 1, inputsignature Sig in computed by backward rewriting is unique andcorrectly represents the function implemented by the circuit in GF (2 m ) . Proof:
The proof of correctness relies on the fact that eachtransformation step (rewriting iteration) is correct. That is,each internal signal is represented by an algebraic expression,which always evaluates to a correct value in GF (2 m ) . Thisis guaranteed by the correctness of the algebraic model in Eq.(1), which can be proved easily by inspection. For example, thealgebraic expression of XOR(a,b) in Z m is a + b − ab . Whenimplemented in GF (2 m ) , the coefficients in the expressionmust be in GF (2) , hence XOR(a,b) in GF m is representedby a + b . The proof of uniqueness is done by inductionon i , the step of transforming polynomial F i into F i +1 . Adetailed induction proof of this theorem is provided in [2] forexpressions in Z m . (cid:3) Algorithm 1
Backward Rewriting in GF (2 m ) Input: Gate-level netlist of GF (2 m ) multiplierInput: Output signature Sig out , and (optionally) input signature,
Sig in Output: GF function of the design; return
Sig out == Sig in P = { p , p , ..., p n } : polynomials representing gate-level netlist F = Sig out for each polynomial p i ∈ P do for output variable v of p i in F i do replace every variable v in F i by algebraic expression of p i F i → F i +1 for each monomial M in F i +1 do if the coefficient of M %2==0 or M is a constant, M %2==0 then remove M from F i +1 end if end for end for end for return F n and F n =? Sig in Theorem 1, together with the algebraic model of Booleangates (1), provide the basis for polynomial reduction usingbackward rewriting. This is described by Algorithm 1. Themethod takes the gate-level netlist of a GF( m ) multiplieras input and first converts each logic gate into an algebraicexpression using Eq. (1). The rewriting process starts withthe output signature F = Sig out and performs rewriting inreverse topological order, from outputs to inputs. It ends whenall the variables in F i are primary inputs, at which point itbecomes the input signature Sig in [2].Each iteration includes two basic steps: 1) substitute thevariable of the gate output using the expression in the inputsof the gate (Eq.1), and name the new expression F i +1 (lines3 - 6); and 2) simplify the new expression in two ways: a)by eliminating terms that cancel each other (as in the integerarithmetic case [2]), and b) by removing all the monomials(including constants) that reduce to 0 in GF( ) (line 3 andlines 7 - 10). The algorithm outputs the arithmetic functionof the design in GF( m ) after n iterations, where n is thenumber of gates in the netlist. The final expression F n = Sig in a0 b1 a1 b0 a1 b1 a0 b0i4 i3 i2 i1i5i6 z1 z012345 6 78 SigoutSigin
Figure 3: The gate-level netlist of post-synthesized andmapped 2-bit multiplier over GF( ). The irreducible poly-nomial is P ( x ) = x + x + 1 . Sig out : F init = z +x z Eliminating termsG8: F = z +x( i + i ) - G7: F = i + i +x( i + i ) - G6: F = i + i +x( i + i + i ) - G5: F = i + i +x( i + i + i +1) - G4: F = i + i +x( i + i + a b )+2x G3: F = i + i +x( i + a b + a b +1) - G2: F = i + a b +1+x( a b + a b + a b )+2x G1: F = a b + a b +2+x( a b + a b + a b ) Sig in : a b + a b +x( a b + a b + a b ) - Figure 4: Function extraction of a 2-bit GF multiplier shownin Figure 3 using backward rewiring from PO to PI.can be used to verify if the circuit performs the desiredarithmetic function by checking if the computed polynomial Sig in matches the expected specification, if known. Thisequivalence check can be readily performed using canonicalword-level representations, such as BMD [15] or TED [17]which can efficiently check equivalence of two polynomials.Alternatively, if the specification is not known, the computedsignature can serve as the specification extracted from thecircuit. Example 2 (Figure 3): We illustrate our method using apost-synthesized 2-bit multiplier in GF (2 ) , shown in Figure3. The irreducible polynomial is P ( x ) = x + x + 1 . Theoutput signature is Sig out = z + z x , and input signatureis Sig in = ( a b + a b )+( a b + a b + a b ) x . First, F init = Sig out is transformed into F using polynomial of gate g , z = i + i and simplified to F = z + i x + i x . Then,the polynomials F i are successively derived from F i +1 andchecked for a possible reduction. The first reduction happenswhen F is transformed into F , where i (at gate g ) isreplaced by ( a b ). After simplification, a monomial x is identified and removed by modulo 2 from F . Similarreductions are applied during the transformations F → F and F → F . Finally, the function of the design is extractedas expression F . A complete rewriting process is shown inFigure 4. We can see that F = Sig in , which indicates thatthe circuit indeed implements the GF (2 ) multiplication with P ( x ) = x + x + 1 .An important observation is that the potential reductions EEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS 6 take place only within the expression associated with the samedegree of polynomial ring (
Sig out ). In other words, the reduc-tions happen in a logic cone of every output bit independently of other bits, regardless of logic sharing between the cones.For example, the reductions in F and F happen within thelogic cone of output z only. Similarly, in F , the reductionis within logic cone of z . Details of the proof are providedin [13]. C. Implementation
This section describes the implementation of our parallelverification method for Galois field multipliers. Our approachtakes the gate-level netlist as input, and outputs the extractedfunction of the design. It includes four steps:
Step1: Convert netlist to equations.
Parse the gate-levelnetlist into algebraic equations based on Equation 1. Theequations are listed in topological order, to be rewritten bybackward rewriting in the next step. m copies of this equationfile will be made for a GF( m ) multiplier. Step2: Generate signatures.
Split the output signature ofGF( m ) multipliers into m polynomials, with Sig out i = z i .Insert the new signatures into the m copies of the equationfile generated from Step1. Each signature represents a singleoutput bit. Step3: Parallel extraction.
Apply Algorithm 1 to eachequation file to extract the polynomial expression of eachoutput in parallel. In contrast to work on integer arithmetic[2], the internal expression of each output bit does not offerany polynomial reduction ( monomial cancellations ) with otherbits. Ideally, our approach can extract GF( m ) multiplier in m threads. However, due to the limited computing resources, it isimpossible to extract GF( m ) multipliers in m threads when m is very large. Hence, our approach puts a limit on the numberof parallel threads T (T = 5, 10, 20 and 30 have been testedin this work). This process is illustrated in Figure 5. The m extraction tasks are organized into several task sets, orderedfrom LSB to MSB. In each set, the extractions are performedin parallel. Since the runtime of each extraction within the setcan differ, the tasks in the next set will start as soon as anyprevious task terminated. Step4: Finalization.
Compute the final function of themultiplier. Once the algebraic expression of each output bitin GF( ) is computed, our method computes the final functionby constructing the Sig out using the rewriting process in step3.Our algorithm uses a data structure that efficiently imple-ments iterative substitution and elimination during backwardrewriting. It is similar to the data structure employed infunction extraction for integer arithmetic circuits [2], suitablymodified to support simplifications in finite fields algebra.Specifically, in addition to cancellation of terms with oppositesigns, it performs modulo 2 reduction of monomials andconstants. The data structure maintains the record of the terms(monomials) in the expression that contain the variable tobe substituted. It reduces the cost of finding the terms thatwill have their coefficients changed during substitution. Eachelement represents one monomial consisting of the variables
Eqns of netlist
Sigout = z0
Eqns of netlist
Sigout = z1
Eqns of netlist
Sigout = zm-2
Eqns of netlist
Sigout = zm-1 … z0 z1 z2 … zT-1zT zT+1 zT+2 … z2T-1 … … zm-2 zm-1 Parallel extraction
Figure 5: Step3: parallel extraction of a GF( m ) multiplierwith T threads.in the monomials and its coefficient. The expression datastructure is a C++ object that represents a pseudo-Booleanexpression, which contains of all the elements in the datastructure. It supports both fast addition and fast substitutionwith two C++ maps, implemented as binary search trees, aterms map, and a substitution map. This data structure includestwo cases of simplifications: 1) after substitution the coeffi-cients of all the monomials are updated and the monomialswith coefficient zero are eliminated; 2) the monomials whosecoefficient modulo 2 evaluate to 0 are eliminated. The secondcase is applied after each substitution. Sig out = z elim Sig out =x · z elimG8: z - G8: i x + i x -G7: i + i - G7: i x + i x -G6: i + i - G6: i x +x+ i x -G5: i + i - G5: i x +x+ i x + i x -G4: i + i - G4: i x + x + i x + a b x + x i + i - G3: i x + a b x +x+ a b x -G2: i + a b +1 - G2: a b x + x + a b x + x + a b x + a b + a b + G1: x( a b + a b + a b ) - z = a b + a b , z =x( a b + a b + a b ) Figure 6:
Extracting the algebraic expression of z and z in Fig.4. Example 3 (Figure 6): We illustrate our parallel extractionmethod using a 2-bit multiplier in GF( ) in Figure 3. Theoutput signature Sig out = z + z x is split into two signatures, Sig out = z and Sig out = z . Then, the rewriting process isapplied to Sig out and Sig out in parallel. When Sig out and Sig out have been successfully extracted, the two signaturesare merged into Sig out + x · Sig out , resulting in the polyno-mial Sig in . In Figure 4, we can see that elimination happensthree times ( F , F , and F ). As expected, this happens withineach element in GF( n ). In Figure 6 one elimination in Sig out and two eliminations in Sig out have been done independently,as shown earlier (refer to Example 2).V. R EVERSE E NGINEERING
In this section, we present our approach to perform reverseengineering of GF( m ) multipliers. Using the extraction tech-nique presented in the previous section, we can extract the EEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS 7 algebraic expression of each output bit. In contrast to thealgebraic techniques of [6] [10], our extraction technique canextract the algebraic expression of each output bit indepen-dently. This means that the extraction can be done without theknowledge of the bit position of the inputs and outputs. Twotheorems are provided and proved to support this claim.In a GF( m ) multiplication, let s i ( i ∈ { m -1 } ) be aset of partial products generated by AND gates and combinedwith an XOR operations. For example, in Figure 1, there aresix product sets, s , s , ..., s , where s = a b + a b ; or writtenas a set: s = { a b , a b } , etc. These product sets are dividedinto two groups: those with index i ≤ m − , called in-field product sets; and those with index i ≥ m , called out-of-field product sets. The in-field product sets s i , in this case s , s , s , s , correspond to the output bits z i . The out-of-fieldproduct sets will be reduced into the field GF( m ) using mod P ( x ) operation, and assigned to the respective output bit, asdetermined by P ( x ) . In the case of Figure 1, the out-of-fieldsets are s , s , s . In general, for a GF( m ) multiplication, m product sets are in-field , and m -1 product sets are out-of-field [38]. A. Output Encoding Determination
We will now demonstrate how to determine the encoding,and hence bit position, of the outputs.
Theorem 2:
Given a GF( m ) multiplication, the in-fieldproduct sets ( s , s , ..., s m − ) appear in exactly one element ofGF( m ) each, and the out-of-field product sets ( s m , s m +1 , ..., s m − ) appear in at least two elements (outputs) of GF( m ),as a result of reduction mod P ( x ) . Proof:
An irreducible polynomial in GF( m ) has the stan-dard form P ( x ) = x m + P (cid:48) ( x ) , where the tail polynomial P (cid:48) ( x ) contains at least two monomials x d with degree d < m .For example, there are two such monomials for a trinomial,four for pentanomial, etc. Since P ( x ) = 0 we have x m = P (cid:48) ( x ) in GF( m ). Hence the variable x m , associated withthe first out-of-field partial product set s m will appear in atleast two outputs, determined by P (cid:48) ( x ) . Other variables, x k ,associated with out-of-field partial product set s k , for k > m ,can be expressed as x k = x k − m x m = x k − m P (cid:48) ( x ) and willcontain at least two elements. QED (cid:3) In fact, the number of outputs in which the out-of-fieldset s k will appear is equal to the number of monomials inthe above product x k − m P (cid:48) ( x ) , provided that every monomial x j with j > m is recursively reduced mod P (cid:48) ( x ) , i.e., byusing relation x m = P (cid:48) ( x ) . We illustrate this fact with anexample of multiplication in GF( ) using irreducible polyno-mial P ( x ) = x + x + 1 shown in the left side of Figure1. The in-field sets, associated with outputs z , z , z , z , are s , s , s , s . Since P ( x ) = x + x + 1 = 0 , we obtain x = x + 1 . This means that set s appears in two outputcolumns, z and z . Then x = x · x = x ( x + 1) = x + x = x + x + 1 , which means that s appears in three outputs: z , z , z .Finally, x = x · x = x ( x + x + 1) = x + x + x = x + x + x + 1 , that is, s will appear in four outputs: z , z , z , z . Asexpected, this matches the left Table in Figure 1. Note therecursive derivation of x k for k > m , which increases thenumber of columns to which a given set s k is assigned.Based on Theorem 2, we can find the in-field productsets, s , s , ..., s m − , by searching the unique products inthe resulting algebraic expressions of the output bits. In thiscontext, unique products are the products that exist in onlyone of the extracted algebraic expressions. Since the in-fieldproduct set indicates the bit position of the output, we candetermine the bit positions of the output bits as soon as all thein-field product sets are identified. Example 4 (Figure 2):
We illustrate the procedure of de-termining bit positions with an example of a GF( ) multiplierimplemented using irreducible polynomial P ( x ) = x + x + (see Figure 1). Note that in this process the labels do notoffer any knowledge of the bit positions of inputs and outputs.The extracted algebraic expressions of the four output bits areshown in Figure 2. The labels of the variables do not indicateany binary encoding information. We first identify the uniqueproducts that include set s = a b in algebraic expression of z ;set s =( a b + a b ) in z ; set s =( a b + a b + a b ) in z ; andset s =( a b + a b + a b + a b ) in z . Note that the numberof products in the in-field product set s i is i . Hence, we findall the in-field product sets and their relation to the extractedalgebraic to be as follows: s = a b , z → Least significant bit (LSB) s = a b + a b , z → nd output bit s = a b + a b + a b , z → rd output bit s = a b + a b + a b + a b , z → Most significant bit (MSB)
B. Input Encoding Determination
Algorithm 2
Input encoding determination for GF (2 m ) Input: a set of algebraic expressions represent the in-field product sets S Output: bit position of input variables S = { s , s , ..., s m − } initialize a vector of variables V ← {} for i=0, i ≤ m-1, i++ do for each variable v in algebraic expression of s i do if v does not exist in V then assign bit position value of v = i store v in variable set V end if end for end for return V We can now determine the bit position of the input variablesusing the procedure outlined in Algorithm 2. The input bit po-sition can be determined by analyzing the in-field product sets,obtained in the previous step. Based on the GF multiplicationalgorithm, we know that s is generated by an AND functionwith two LSBs of the two inputs; and the two products in s are generated by the AND and XOR operations using twoLSBs and two nd input bits, etc. For example in a GF( )multiplication (Figure 1), s = a b , where a and b are LSBs; s = a b + a b , where a , b are LSBs; a , b are nd LSBs.This allows us to determine the bit position of the input bitsrecursively by analyzing the algebraic expression of s i . We EEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS 8 illustrate this with the GF( ) multiplier implemented using P ( x ) = x + x + (Figure 2). Example 5 (Algorithm 2):
The input of our algorithm isa set of algebraic expressions of the in-field product sets, s , s , s , s (line 1). We initialize vector V to store thevariables in which their bit positions are assigned (line 2).The first algebraic expression is s . Since the two variables, a and b are not in V , the bit positions of these two variablesare assigned index i = 0 (line 4-8). In the second iteration, V = { a , b } , and the input algebraic expression is s , includingvariables a , b , a and b . Because a and b are not in V , their bit position is i = 1 . The loop ends when all thealgebraic expressions in S have been visited, and returns V = { ( a , b ) , ( a , b ) , ( a , b ) , ( a , b ) } . The subscriptsare the bit position values of the variables returned by thealgorithm. Note that this procedure only gives the bit positionof the input bits; the information of how the input words areconstructed is unknown. There are m − combinations fromwhich the words can be constructed using the informationreturned in V . For example, the two input words can be W = a +2 a +4 b +8 a and W = b +2 b +4 a +8 b ; or they canbe W (cid:48) = a +2 a +4 b +8 b and W (cid:48) = b +2 b +4 a +8 a . Althoughthere may be many combinations for constructing the inputwords, the specification of the GF( m ) is unique. C. Extraction of the Irreducible Polynomial
Theorem 3:
Given a multiplication in GF( m ), let thefirst out-of-field product set be s m . Then, the irreduciblepolynomial P ( x ) includes monomials x m and { x i } iff allproducts in the set s m appear in the algebraic expression ofthe i th output bits, for all i < m . Proof:
Based on the definition of field arithmetic forGF( m ), the polynomial basis representation of s m is x m s m .To reduce s m into elements in the range [0, m − ], the fieldreductions are performed modulo irreducible polynomial P ( x ) with highest degree m ( c.f. the proof of Theorem 2). Asbefore, let P ( x ) = x m + P (cid:48) ( x ) . Then, x m s m mod ( x m + P (cid:48) ( x )) = s m P (cid:48) ( x ) Hence, if x i exists in P (cid:48) ( x ) , it also exists in P ( x ) . Therefore, x i exists in P ( x ) , iff x i s m exists in x m s m mod P ( x ) . (cid:3) Even though the input bit positions have been determinedin the previous step, we cannot directly generate s m sincethe combination of the input bits for constructing the inputwords is still unknown. In Example 5 ( m =4), we can see that s m = { a b , a b , a b } when input words are W and W ; but s m = { a a , a b , b b } when inputs words are W (cid:48) and W (cid:48) .To overcome this limitation, we create a set of products s (cid:48) m ,which includes all the possible products that can be generatedbased on all input combinations. The set s (cid:48) m includes the true products, i.e., those that exist in the first out-of-field productset; and it also includes some dummy products. The dummyproducts are those that never appear in the resulting algebraicexpressions. Hence, we first generate the set s (cid:48) m and eliminatethe dummy products by searching the algebraic expressions.After this, we obtain s m . Then, we use s m to extract theirreducible polynomial P ( x ) using Algorithm 3. Example 6:
We illustrate the method of reverse engineeringthe irreducible polynomial using the GF (2 ) multiplier of Fig.1. The procedure is outlined in Algorithm 3. The extractedalgebraic expressions S (line 1 at Algorithm 3) is shownin Figure 2. The bit position of input bits is determined byAlgorithm 2 (line 2). Based on the result of Algorithm 2, wegenerate s (cid:48) m = { a a , b b , a b , a b , a b } . To eliminate thedummy products from s (cid:48) m , we search all algebraic expressionsin S , and eliminate the products that cannot be part of theresulting products. In this case, we find that a a and b b arethe dummy products. Hence, we get s m = { a b , a b , a b } (line 3). Based on the definition of irreducible polynomial, P ( x ) must include x m ; in this example m = 4 (line 4). Whilelooping over all the algebraic expressions, the expressions for z and z contain all the products of s m . Hence, x and x are included in P ( x ) , so that P ( x ) = x + x + x . We can seethat it is the same as P ( x ) in Figure 1. Algorithm 3
Extracting irreducible polynomial in GF (2 m ) Input: the algebraic expressions of output bits S Input: the first out-of-field product set s m Output: Irreducible polynomial P ( x ) S = { exp , exp , ..., exp m − } V ← Algorithm 2( S ) s m ← eliminate dummy ( s (cid:48) m ← V , S ) P ( x ) = x m : initialize irreducible polynomial for i=0, i ≤ m-1, i++ do if all products in s m exist in exp i then P ( x ) += x i end if end for return P ( x ) In summary, using the framework presented in Section IV-C,we first extract the algebraic expressions of all output bits.Then, we analyze the algebraic expressions to find the bitposition of the input bits and the output bits, and extract theirreducible polynomial P ( x ) . In the example of the GF( )multiplier implemented using P ( x ) = x + x + , shown inFigure 1, the final results returned by our approach givesthe following: 1) the input bits set V = { ( a , b ) , ( a , b ) , ( a , b ) , ( a , b ) } , where the subscripts represent the bitposition; 2) z is the least significant bit (LSB), z is the nd output bit, z is the rd output bit, and z is the mostsignificant bit (MSB); 3) irreducible polynomial is P ( x ) = x + x + ; 4) the specification can be verified using the ap-proach presented in Section IV with the reverse engineeredinformation. VI. R ESULTS
The experimental results of our method are presentedin two subsections: 1) evaluation of parallel verification ofGF( m ) multipliers; and 2) evaluation of reverse engineeringof GF( m ) multipliers. The results given in this section includedata (total time and maximum memory) for the entire verifica-tion or reverse engineering process, including translating thegate-level verilog netlist to the algebraic equation, performingbackward rewriting and other required functions. A. Parallel Verification of GF( m ) Multipliers The verification technique for GF( m ) multipliers presentedin Section IV was implemented in C++. It performs backward EEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS 9
Mastrovito [5] This workOp size
T=1 T=5 T=10 T=20 T=30 T=1*
32 5,482 1 3 5 2 1 1 1 10 MB48 12,228 8 13 9 6 3 3 2 21 MB64 21,814 29 21 19 11 8 7 7 37 MB96 51,412 195 45 68 38 26 20 23 84 MB128 93,996 924 91 153 91 63 55 57 152 MB163 153,245 3546 161 336 192 137 121 113 248 MB233 167,803 4933 168 499 294 212 180 171 270 MB283 399,688 30358 380 1580 890 606 550 530 642 MB571 1628,170 TO - 13176 7980 5038 MO MO
TABLE I: Results of verifying Mastrovito multipliers using our parallel approach. T is the number of threads. M O =Memoryout of 32 GB.
T O =Time out of 18 hours.(*
T=1 shows the maximum memory usage of a single thread.)
Montgomery [5] This workOp size
T=1 T=5 T=10 T=20 T=30 T=1*
32 4,352 2 3 5 3 2 1 2 8 MB48 9,602 14 13 34 18 11 9 6 16 MB64 16.898 63 21 80 45 31 28 27 27 MB96 37,634 554 45 414 234 157 133 142 59 MB128 66,562 1924 68 335 209 121 115 110 95 MB163 107,582 12063 101 2505 1616 1172 1095 1008 161 MB233 219,022 TO
168 1240 722 565 457 480 301 MB283 322,622 TO
380 32180 19745 17640 15300 14820 488 MB
TABLE II: Results of verifying
Montgomery multipliers using our parallel approach. T is the number of threads. T O =Timeout of 18 hours.(*
T=1 shows the maximum memory usage of a single thread.)rewriting with variable substitution and polynomial reductionsin Galois field in parallel fashion. The program was testedon a number of combinational gate-level GF (2 m ) multiplierstaken from [6], including the Montgomery multipliers [39] andMastrovito multipliers [40]. The bit-width of the multipliersvaries from 32 to 571 bits. The verification results for variousGalois field multipliers obtained using SAT, SMT, ABC [41],and Singular [42], have already been presented in works of[1] and [6]. They clearly demonstrate that techniques basedon computer algebra perform significantly better than otherknown techniques. Hence, in this work, we only compare ourapproach to those two, and specifically to the tool describedin [6]. However, in contrast to the previous work on Galoisfield verification, all the GF( m ) multipliers used in thispaper are bit-blasted gate-level implementations. The bit-levelmultipliers are taken from [6] and mapped onto gate-levelcircuits using ABC [41]. Our experiments were conductedon a PC with Intel(R) Xeon CPU E5-2420 v2 2.20 GHz ×
12 with 32 GB memory. As described in the next section,our technique can verify Galois field multipliers in multiplethreads by applying Algorithm 1 to each output bit in parallel.The number of threads is given as input to the tool.The experimental results of our approach and comparisonwith [6] are shown in Table I for gate-level Mastrovitomultipliers with bit-width varying from 32 to 571 bits. Thesemultipliers are directly mapped using ABC without any opti-mization. The largest circuit includes over 1.6 million gates.This is also the number of polynomial equations and thenumber of rewriting iterations (see Section IV). The resultsgenerated by the tool, presented in [6] are shown in columns3 and 4 of Table I. We performed four different series ofexperiments, with the number of threads T varying from 5 to 30. The table shows CPU runtime and memory usage fordifferent values of T . The timeout limit (TO) was set to 12hours and memory limit (MO) to 32 GB. The experimentalresults show that our approach provides on average 26.2 × ,37.8 × , 42.7 × , and 44.3 × speedup, for T =
5, 10, 20, and 30threads, respectively. Our approach can verify the multipliersup to 571 bit-wide multipliers in 1.5 hours, while that of [6]fails after 12 hours.The reported memory usage of our approach is the max-imum memory usage per thread . This means that our toolexperiences maximum memory usage with all T threadsrunning in the process; in this case, the memory usage is T · M em . This is why the 571-bit Mastrovito multipliers couldbe successfully verified with T = 5 and 10, but failed with T = 20 and 30 threads. For example, the peak memory usage of571-bit Mastrovito multiplier with T = 20 is . ×
20 = 52
GB, which exceeds the available memory limit.We also tested Montgomery multipliers with bit-width vary-ing from 32 to 283 bits; the results are shown in TableII. These experiments are different than those in [6]. Inour work, we first flatten the Montgomery multipliers beforeapplying our verification technique. That is, we assume thatonly the positions of the primary inputs and outputs areknown, without the knowledge of any high-level structure.In contrast, [6] verifies the Montgomery multipliers that arerepresented with four hierarchical blocks. For 32- to 163-bitMontgomery multipliers, our approach provides on average a9.2 × , 15.9 × , 16.6 × , and 17.4 × speedup, for T =
5, 10, 20,and 30, respectively. Notice that [6] cannot verify the flattenedMontgomery multipliers larger than 233 bits in 12 hours.Analyzing Table I we observe that the rewriting technique ofour approach when applied to Montgomery multipliers require
EEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS 10 significantly more time than for Mastrovito multipliers. Themain reason for this difference is the internal architecture ofthe two multiplier types. Mastrovito multipliers are obtaineddirectly from the standard multiplication structure, with thepartial product generator followed by an XOR-tree structure,as in modular arithmetic. Since the algebraic model of XOR inGF arithmetic is linear, the size of the polynomial expressionsgenerated during rewriting of this architecture is relativelysmall. In contrast, in a Montgomery multiplier the two inputsare first transformed into Montgomery form; the products ofthese Montgomery forms are called
Montgomery products .Since the polynomial expressions in Montgomery forms aremuch larger than partial products, the increase in size ofintermediate expressions is unavoidable. Dependence on P ( x ) : In Table II, we observe that CPUruntime for verifying a 163-bit multiplier is greater than thatof a 233-bit multiplier. This is because the computational com-plexity depends not only on the bit-width of the multiplier, butalso on the irreducible polynomial P ( x ) used in constructingthe multiplier.We illustrate this fact using two GF( ) multiplicationsimplemented using two different irreducible polynomials (c.f.Figure 1). We can see that for P ( x ) = x + x + 1 , the longestlogic paths for z and z , include ten and seven productsthat need to be generated using XORs, respectively. However,when P ( x ) = x + x + 1 , the two longest paths, z and z ,have only seven and six products. This means that the GF( )multiplication requires 9 XOR operations using P ( x ) and 6XOR operations using P ( x ) . In other words, the gate-levelimplementation of the multiplier implemented using P ( x ) hasmore gates compared to P ( x ) . In conclusion, we can see thatirreducible polynomial P ( x ) has significant impact on bothdesign cost and the verification time of the GF( m ) multipliers. R un t i m e M e m o r y Mont-RuntimeMastr-RuntimeMont-MemoryMastr-Memory
Figure 7: Runtime and memory usage of parallel verificationapproach as a function of the number of threads T . Runtime vs. Memory : In this section, we discuss thetradeoff of runtime and memory usage of our parallel approachto Galois Field multiplier verification. The plots in Figure 7show the average runtime and memory usage for differentnumber of threads, over the set of multipliers shown in TablesI and II (32 to 283 bits). The vertical axis on the left isCPU runtime (in seconds), and on the right is memory usage (MB). The horizontal axis represents the number of threads T ,ranging from 1 to 30. The runtime is significantly improved for T ranging from 5 to 15. However, there is not much speedupwhen T is greater than 20, most likely due to the memorymanagement synchronization overhead between the threads.Similarly to the results for Mastrovito multipliers (Table I),our approach is limited here by the memory usage when thesize of the multiplier and the number of threads T are large.In our work, T = 20 seems to be the best choice. Obviously, T varies for different platforms, depending on the number ofcores and the memory.We also analyzed the runtime complexity of our verificationalgorithm for a single thread (T=1) computation; it is shownin Figure 8. The y-axis shows the total runtime of rewritingthe polynomial expressions, and x-axis indicates the size ofthe Mastrovito multiplier. The result shows that the overallspeedup is roughly the same for each value of T. Montgomerymultipliers exhibit similar behavior, regardless of the choiceof the irreducible polynomial. R un t i m e ( sec ) Size of the Mastrovito Multiplier T=1 T=5 T=10
Figure 8:
Single thread runtime analysis for Mastrovito multipliers. Effect of Synthesis on Verification : In [10] the authorsconclude that highly bit-optimized integer arithmetic circuitsare harder to verify than their original, pre-synthesized netlists.This is because the efficiency of the rewriting technique relieson the amount of cancellations between the different terms ofthe polynomial, and such cancellations strongly depend on theorder in which signals are rewritten. A good ordering of signalsis difficult to achieve in highly bit-optimized synthesizedcircuits.To see the effect of synthesis on parallel verification ofGF circuits, we applied our approach to post-synthesized
Galois field multipliers with operands up to 409 bits (571-bit multipliers could not be synthesized in a reasonable time).We synthesized
Mastrovito and
Montgomery multipliers using
ABC tool [41]. We repeatedly used the commands resyn2 and dch until the number of AIG levels or nodes couldnot be reduced anymore. The synthesized multipliers weremapped using a 14nm technology library. The verificationexperiments shown in Table III are performed by our toolwith T = 20 threads. Our tool was able to verify both 409-bit Mastrovito and
Montgomery multipliers within just 13minutes. We observed that in our parallel approach Galoisfield multipliers are easier to be verified after optimization ”dch” is the most efficient bit-optimization function in ABC. EEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS 11 than in their original form. For example, the verification ofa 283-bit Montgomery multiplier takes 15,300 seconds for T =
20. After optimization, the runtime dropped to just 169.2seconds, which means that such a verification is 90x fasterthan of the original implementation. The memory usage hasalso been reduced from 488 MB to 194 MB. In summary, incontrast to verification problems of integer multipliers [10],the bit-level optimization actually reduces the complexity ofbackward rewriting process. This is because extracting thefunction of an output bit of a GF multiplier depends only onthe logic cone of that bit and does not require logic expressionfrom other bits to be simplified (c.f. Theorem 3). Hence, thecomplexity of function extraction is naturally reduced if thelogic cone is minimized.
Op size Mastrovito Montgomery
TABLE III: Runtime and memory usage of synthesized
Mas-trovito and
Montgomery multipliers ( T =20). B. Reverse Engineering of GF( m ) Multipliers The reverse engineering technique presented in this paperwas implemented in the framework described in Section V inC++. It reverse engineers bit-blasted GF( m ) multipliers byanalyzing the algebraic expressions of each element using theapproach presented in Section IV. The program was testedon a number of gate-level GF (2 m ) multipliers with differentirreducible polynomials, including Montgomery multipliersand Mastrovito multipliers. The multiplier generator, takenfrom [1], takes the bit-width and the irreducible polynomialas inputs and generates the multipliers in the equation format.The experimental results show that our technique can success-fully reverse engineer various GF( m ) multipliers, regardlessof the GF( m ) algorithm and the irreducible polynomials.We set the number of threads to 16 for all the reverseengineering evaluations in this section. This is dictated by thefact that T=16 gives most promising performance (runtime)and scalability (memory usage) metrics on our platform, basedon the analysis presented in Section VI-A2 (Figure 7). m P ( x ) Mastrovito-syn Montgomery-syn
T(s) Mem T(s) Mem64 x + x + x + x +1 13 25 MB 5 20 MB163 x + x + x + x +1 69 508 MB 221 610 MB233 x + x +1 152 1.2 GB 154 2.9 GB409 x + x +1 825 6.5 GB 855 10.3 GB TABLE IV:
Results of reverse engineering synthesized and technol-ogy mapped Mastrovito and Montgomery multipliers.
Our program takes the netlist/equations of the GF( m ) im-plementations, and the number of threads as input. Hence, theusers can adjust the parallel efforts depending on the limitationof the machines. In this work, all results are performed in 16 threads. Typical designs that require reverse engineeringare those that have been bit-optimized and mapped using astandard-cell library. Hence, we apply our technique to thebit-optimized Mastrovito and Montgomery multipliers (TableIV). For the purpose of our experiments, the multipliers areoptimized and mapped using ABC [41]. Compared to theverification runtime of synthesized multipliers (Table III), theCPU time spent on analyzing the extracted expressions forreverse engineering is less than 10% of the extraction process.This is because most computations of reverse engineeringapproach are associated with those for extracting the algebraicexpressions, as presented in Section VI-A2, Table III. R un t i m e ( s ) x^233+x^201+x^105+x^9+1x^233+x^185+x^121+x^105+1x^233+x^74+1x^233+x^159+1 Figure 9:
Result of reverse engineering GF( ) Mastrovito multi-pliers implemented with different P(x). The reverse engineering approach has been further evaluatedusing four Mastrovito multipliers, each implemented witha different irreducible polynomial P ( x ) in GF( ). Thepolynomials are obtained from [43] and optimized using ABCsynthesis tool. The results are shown in Figure 9. We can seethat the multipliers implemented with trinomial P ( x ) are mucheasier to be reverse engineered than those based on a pen-tanomial P ( x ) . This is because the multipliers implementedwith pentanomial P ( x ) contain more gates and have longercritical path, since the reduction over pentanomial requiresmore XOR operations. The CPU runtime for irreducible poly-nomial of the same class (trinomials or pentanomials) is almostthe same. As discussed in Section III-B, comparison of thetwo trinomials shows that the efficient trinomial irreduciblepolynomial, x m + x a +1, typically satisfies m - a>m/ .The results for designs synthesized with 14nm technol-ogy library are shown in Figure 10. It shows that thearea and delay of the Mastrovito multiplier implementedwith P ( x ) = x + x + are 5.7% and 7.4% less than for P ( x ) = x + x + , respectively.Figure 10: Evaluation of the design cost using GF( ) Mas-trovito multipliers with irreducible polynomials x + x + and x + x + . VII. C
ONCLUSION
This paper presents a parallel approach to verification andreverse engineering of gate-level Galois Field multipliers using
EEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS 12 computer algebraic approach. It introduces a parallel rewritingmethod that can efficiently extract functional specificationof Galois Field multipliers as polynomial expressions. Wedemonstrate that compared to the best known algorithms, ourapproach tested on T =30 threads provides on average 44 × and17 × speedup in verification of Montgomery and Mastrovitomultipliers, respectively. We presented a novel approach thatreverse engineers the gate-level Galois Field multipliers, inwhich the irreducible polynomial, as well as the bit positionof the inputs and outputs are unknown. We demonstrated thatour approach can efficiently reverse engineer the Galois Fieldmultipliers implemented using different irreducible polynomi-als. Future work will focus on formal verification of primefield arithmetic circuits and complex cryptography circuits.A CKNOWLEDGMENT
The authors would like to thank Prof. Kalla, University ofUtah, for his valuable comments and the benchmarks; andDr. Arnaud Tisserand, University Rennes 1 ENSSAT, for hisvaluable discussion. This work has been funded by NSF grants,CCF-1319496 and CCF-1617708.R
EFERENCES[1] J. Lv, P. Kalla, and F. Enescu, “Efficient Grobner Basis Reductions forFormal Verification of Galois Field Arithmatic Circuits,”
IEEE Trans.on CAD , vol. 32, no. 9, pp. 1409–1420, September 2013.[2] M. Ciesielski, C. Yu, W. Brown, D. Liu, and A. Rossi, “Verification ofGate-level Arithmetic Circuits by Function Extraction,” in .ACM, 2015, pp. 52–57.[3] C. Paar and J. Pelzl,
Understanding cryptography: a textbook forstudents and practitioners . Springer Science & Business Media, 2009.[4] M. Ciet, J.-J. Quisquater, and F. Sica, “A short note on irreducibletrinomials in binary fields,” in , 2002.[5] T. Pruss, P. Kalla, and F. Enescu, “Equivalence Verification of LargeGalois Field Arithmetic Circuits using Word-Level Abstraction viaGr¨obner Bases,” in
DAC’14 , 2014, pp. 1–6.[6] ——, “Efficient symbolic computation for word-level abstraction fromcombinational circuits for verification over finite fields,”
IEEE Trans. onCAD of Integrated Circuits and Systems , vol. 35, no. 7, pp. 1206–1218,2016.[7] E. Pavlenko, M. Wedler, D. Stoffel, W. Kunz, A. Dreyer, F. Seelisch,and G. Greuel, “Stable: A new qf-bv smt solver for hard verificationproblems combining boolean reasoning with computer algebra,” in
DATE , 2011, pp. 155–160.[8] F. Farahmandi and B. Alizadeh, “Groebner basis based formal ver-ification of large arithmetic circuits using gaussian elimination andcone-based polynomial extraction,”
Microprocessors and Microsystems ,vol. 39, no. 2, pp. 83–96, 2015.[9] A. Sayed-Ahmed, D. Große, U. K¨uhne, M. Soeken, and R. Drechsler,“Formal verification of integer multipliers by combining grobner basiswith logic reduction,” in
DATE’16 , 2016, pp. 1–6.[10] C. Yu, W. Brown, D. Liu, A. Rossi, and M. J. Ciesielski, “Formalverification of arithmetic circuits using function extraction,”
IEEE Trans.on CAD of Integrated Circuits and Systems , vol. 35, no. 12, pp. 2131–2142, 2016.[11] C. Yu and M. J. Ciesielski, “Automatic word-level abstraction ofdatapath,” in
IEEE International Symposium on Circuits and Systems,ISCAS 2016, Montr´eal, QC, Canada , 2016, pp. 1718–1721.[12] A. Sayed-Ahmed, D. Große, M. Soeken, and R. Drechsler, “Equivalencechecking using grobner bases,”
FMCAD’2016 , 2016.[13] C. Yu and M. J. Ciesielski, “Efficient parallel verification of galois fieldmultipliers,”
ASP-DAC’17 , 2017.[14] R. E. Bryant, “Graph-based algorithms for boolean function manipula-tion,”
IEEE Trans. on Computers , vol. 100, no. 8, pp. 677–691, 1986.[15] R. E. Bryant and Y. Chen, “Verification of arithmetic circuits with binarymoment diagrams,” in
Proceedings of the 32st Conference on DesignAutomation, San Francisco, California, USA, Moscone Center, June 12-16, 1995. , 1995, pp. 535–541. [16] Y.-A. Chen and R. Bryant, “*PHDD: An Efficient Graph Representationfor Floating Point Circuit Verification,” School of Computer Science,Carnegie Mellon University, Tech. Rep. CMU-CS-97-134, 1997.[17] M. Ciesielski, P. Kalla, and S. Askar, “Taylor Expansion Diagrams: ACanonical Representation for Verification of Data Flow Designs,”
IEEETrans. on Computers , vol. 55, no. 9, pp. 1188–1201, Sept. 2006.[18] R. Kaivola, R. Ghughal, N. Narasimhan, A. Telfer, J. Whittemore,S. Pandav, A. Slobodov´a, C. Taylor, E. R. V. Frolov, and A. Naik., “Re-placing Testing with Formal Verification in Intel CoreTM i7 ProcessorExecution Engine Validation,” in
Computer Aided Verification (CAV) .Springer, 2009, pp. 414–429.[19] A. Mishchenko et al. , “ABC: A System for Sequential Synthesis andVerification,” , 2007.[20] N. Sorensson and N. Een, “Minisat v1. 13-a sat solver with conflict-clause minimization,”
SAT , vol. 2005, p. 53, 2005.[21] M. Soos, “Enhanced Gaussian Elimination in DPLL-based SATSolvers.” in
POS@ SAT , 2010, pp. 2–14.[22] M. Davis, G. Logemann, and D. Loveland, “A machine program fortheorem-proving,”
Communications of the ACM , vol. 5, no. 7, pp. 394–397, 1962.[23] C.-Y. Huang and K.-T. Cheng, “Using Word-level ATPG and ModularArithmetic Constraint-Solving Techniques for Assertion Property Check-ing,”
IEEE Trans. on CAD , vol. 20, no. 3, pp. 381–391, March 2001.[24] F. Fallah, S. Devadas, and K. Keutzer, “Functional vector generationfor hdl models using linear programming and 3-satisfiability,” in
DesignAutomation Conference (DAC) . IEEE, 1998, pp. 528–533.[25] R. Brinkmann and R. Drechsler, “RTL-datapath Verification using Inte-ger Linear Programming,” in
Proceedings of the 2002 Asia and SouthPacific Design Automation Conference (ASP-DAC) . IEEE ComputerSociety, 2002, p. 741.[26] Z. Zeng, K. R. Talupuru, and M. Ciesielski, “Functional Test GenerationBased on Word-level SAT,”
Journal of Systems Architecture , vol. 51,no. 8, pp. 488–511, 2005.[27] A. Biere, M. Heule, and H. van Maaren,
Handbook of satisfiability . iospress, 2009, vol. 185.[28] A. Niemetz, M. Preiner, and A. Biere, “Boolector 2.0,”
Journal onSatisfiability, Boolean Modeling and Computation , vol. 9, 2015.[29] L. De Moura and N. Bjørner, “Z3: An efficient smt solver,” in
Tools andAlgorithms for the Construction and Analysis of Systems . Springer,2008, pp. 337–340.[30] C. Barrett, C. L. Conway, M. Deters, L. Hadarean, D. Jovanovi´c, T. King,A. Reynolds, and C. Tinelli, “CVC4,” in
Computer aided verification(CAV) . Springer, 2011, pp. 171–177.[31] M. J. Gordon and T. F. Melham, “Introduction to HOL A TheoremProving Environment for Higher Order Logic,” in
Cambridge UniversityPress , 1993.[32] S. Owre, J. M. Rushby, and N. Shankar, “PVS: A Prototype VerificationSystem,” in
Automated Deduction - CADE-11 . Springer, 1992, pp. 748–752.[33] B. Brock, M. Kaufmann, and J. S. Moore, “Acl2 theorems aboutcommercial microprocessors,” in
Formal Methods in Computer-AidedDesign (FMCAD) . Springer, 1996, pp. 275–293.[34] S. Vasudevan, V. Viswanath, R. W. Sumners, and J. A. Abraham,“Automatic Verification of Arithmetic Circuits in RTL using StepwiseRefinement of Term Rewriting Systems,”
IEEE Trans. on Computers ,vol. 56, no. 10, pp. 1401–1414, 2007.[35] D. Kapur and M. Subramaniam, “Mechanical Verification of AdderCircuits using Rewrite Rule Laboratory,”
Formal Methods in SystemDesign (FMCAD) , vol. 13, no. 2, pp. 127–158, 1998.[36] D. Stoffel and W. Kunz, “Equivalence Checking of Arithmetic Circuitson the Arithmetic Bit Level,”
IEEE Trans. on CAD , vol. 23, no. 5, pp.586–597, May 2004.[37] NIST, “Recommended elliptic curves for federal government use,” 1999.[38] C. Yu, D. Holcomb, and M. Ciesielski, “Reverse engineering of irre-ducible polynomials in gf (2 m) arithmetic,” in . IEEE, 2017, pp.1558–1563.[39] C. K. Koc and T. Acar, “Montgomery multiplication in gf (2k),”
Designs,Codes and Cryptography , vol. 14, no. 1, pp. 57–69, 1998.[40] B. Sunar and C¸ . K. Koc¸, “Mastrovito multiplier for all trinomials,”
Computers, IEEE Transactions on , vol. 48, no. 5, pp. 522–527, 1999.[41] A. Mishchenko et al. , “Abc: A system for sequential synthesis andverification,” , 2007.[42] W. Decker, G.-M. Greuel, G. Pfister, and H. Sch¨onemann, “S
INGULAR
EEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS 13 [43] M. Scott, “Optimal irreducible polynomials for gf (2m) arithmetic.”