[PDF] Formal Analysis of Galois Field Arithmetics - Parallel Verification and Reverse Engineering

Abstract

Galois field (GF) arithmetic circuits find numerous applications in communications, signal processing, and security engineering. Formal verification techniques of GF circuits are scarce and limited to circuits with known bit positions of the primary inputs and outputs. They also require knowledge of the irreducible polynomial P(x) , which affects final hardware implementation. This paper presents a computer algebra technique that performs verification and reverse engineering of GF( 2 m ) multipliers directly from the gate-level implementation. The approach is based on extracting a unique irreducible polynomial in a parallel fashion and proceeds in three steps: 1) determine the bit position of the output bits; 2) determine the bit position of the input bits; and 3) extract the irreducible polynomial used in the design. We demonstrate that this method is able to reverse engineer GF( 2 m ) multipliers in \textit{m} threads. Experiments performed on synthesized \textit{Mastrovito} and \textit{Montgomery} multipliers with different P(x) , including NIST-recommended polynomials, demonstrate high efficiency of the proposed method.

Full PDF

IIEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS 1

Formal Analysis of Galois Field Arithmetic Circuits - Parallel Veriﬁcation and Reverse Engineering

Cunxi Yu

Student Member, IEEE, and Maciej Ciesielski,

Senior Member, IEEE

Abstract —Galois ﬁeld (GF) arithmetic circuits ﬁnd numerousapplications in communications, signal processing, and securityengineering. Formal veriﬁcation techniques of GF circuits arescarce and limited to circuits with known bit positions of theprimary inputs and outputs. They also require knowledge of theirreducible polynomial P ( x ) , which affects ﬁnal hardware im-plementation. This paper presents a computer algebra techniquethat performs veriﬁcation and reverse engineering of GF( m )multipliers directly from the gate-level implementation. Theapproach is based on extracting a unique irreducible polynomialin a parallel fashion and proceeds in three steps: 1) determinethe bit position of the output bits; 2) determine the bit positionof the input bits; and 3) extract the irreducible polynomialused in the design. We demonstrate that this method is able toreverse engineer GF( m ) multipliers in m threads. Experimentsperformed on synthesized Mastrovito and

Montgomery multiplierswith different P ( x ) , including NIST-recommended polynomials,demonstrate high efﬁciency of the proposed method. Index Terms —Galois ﬁeld arithmetic, computer algebra, for-mal veriﬁcation, reverse engineering, parallelism.

I. I

NTRODUCTION D ESPITE considerable progress in veriﬁcation of randomand control logic, advances in formal veriﬁcation ofarithmetic circuits have been lagging. This can be attributedto the difﬁculty in efﬁcient modeling of arithmetic circuitsand datapaths, without resorting to computationally expensiveBoolean methods. Contemporary formal techniques, such as

Binary Decision Diagrams (BDDs),

Boolean Satisﬁability (SAT),

Satisﬁability Modulo Theories (SMT), etc., are notdirectly applicable to veriﬁcation of integer and ﬁnite ﬁeldarithmetic circuits [1] [2]. This paper concentrates on formalveriﬁcation and reverse engineering of ﬁnite (Galois) ﬁeldarithmetic circuits.Galois ﬁeld (GF) is a number system with a ﬁnite numberof elements and two main arithmetic operations, addition andmultiplication; other operations can be derived from those two[3]. GF arithmetic plays an important role in coding theory,cryptography, and their numerous applications. Therefore,developing formal techniques for hardware implementationsof GF arithmetic circuits, and particularly for ﬁnite ﬁeldmultiplication, is essential.The elements in ﬁeld GF( m ) can be represented usingpolynomial rings. The ﬁeld of size m is constructed using irreducible polynomial P ( x ) , which includes terms of degree C. Yu and M. Ciesielski are with the Department of Electrical andComputer Engineering, University of Massachusetts, Amherst, MA, 01375.The related tools and benchmarks are released publicly on Github, ycunxi.github.io/Parallel Formal Analysis GaloisField

E-mail: [email protected] with d ∈ [ , m ] with coefﬁcients in GF(2). The arithmeticoperation in the ﬁeld is then performed modulo P ( x ) . Thechoice of the irreducible polynomial has a signiﬁcant impacton the hardware implementation of the GF circuit and itsperformance. Typically, the irreducible polynomial with aminimum number of elements gives the best performance [4],but it is not always the case.Due to the rising number of threats in hardware security,analyzing ﬁnite ﬁeld circuits becomes important. Computeralgebra techniques with polynomial representation seem tooffer the best solution for analyzing arithmetic circuits. Sev-eral works address the veriﬁcation and functional abstractionproblems, both in Galois ﬁeld arithmetic [1] [5] [6] and integerarithmetic implementations [7] [2] [8] [9] [10]. Symboliccomputer algebra methods have also been used to reverseengineer the word-level operations for GF circuits and integerarithmetic circuits to improve veriﬁcation performance [11][12] [5]. The veriﬁcation problem is typically formulated asproving that the implementation satisﬁes the speciﬁcation,and is accomplished by performing a series of divisions ofthe speciﬁcation polynomial by the implementation polyno-mials. In the work of Yu et al. [11], the authors proposedan original spectral method based on analyzing the internalalgebraic expressions during the rewriting procedure. Sayed-Ahmed et al. [12] introduced a reverse engineering techniquein Algebraic Combinational Equivalence Checking (ACEC)process by converting the function into canonical polynomialsand using Gr¨obner Basis .However, the above mentioned algebraic techniques haveseveral limitations. Firstly, they are restricted to implementa-tions with a known binary encoding of the inputs and out-puts. This information is needed to generate the speciﬁcationpolynomial that describes the circuit functionality regardingits inputs and outputs, necessary for the polynomial reductionprocess (described in Section II-D). Secondly, these methodsare unable to explore parallelism (inherent in GF circuits),as they require that the polynomial division is applied itera-tively using reverse-topological order [2] [9] [6]. Thirdly, theapproaches applied speciﬁcally to GF( m ) arithmetic circuits[5] [6], require knowledge of the irreducible polynomial P ( x ) of the circuit.In this work, we present a formal approach to reverseengineer the gate-level ﬁnite ﬁeld arithmetic circuits thatexploit inherent parallelism of the GF circuits. The methodis based on a parallel algebraic rewriting approach [13] andapplied speciﬁcally to multipliers. The objective of reverseengineering is as follows: given the netlist of a gate-levelGF multiplier, extract the bit positions of input and output a r X i v : . [ c s . S C ] F e b EEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS 2 bits and the irreducible polynomial used in constructing theGF multiplication; then extract the speciﬁcation of the designusing this information. Bit position i indicates the location ofthe bit in the binary word according to its signiﬁcance (LSBvs MSB). Our approach solves this problem by transformingthe algebraic expressions of the output bits into an algebraicexpression of the input bits (speciﬁcation), and is done in par-allel for each output bit. Speciﬁcally, it includes the followingsteps : • Extract the algebraic expression of each output bit. • Determine the bit position of the outputs. • Determine the bit position of the inputs. • Extract the irreducible polynomial P ( x ) . • Extract the speciﬁcation by algebraic rewriting.We demonstrate the efﬁciency of our method using GF( m ) Mastrovito and

Montgomery multipliers of up to 571-bit widthin a bit-blasted format (i.e., ﬂattened to bit-level), implementedusing various irreducible polynomials.II. B

ACKGROUND

A. Canonical Diagrams

Several approaches have been proposed to check an arith-metic circuit against its functional speciﬁcation. Differentvariants of canonical, graph-based representations have beenproposed, including Binary Decision Diagrams (BDDs) [14],Binary Moment Diagrams (BMDs) [15] [16], Taylor Expan-sion Diagrams (TED) [17], and other hybrid diagrams. WhileBDDs have been used extensively in logic synthesis, theirapplication to veriﬁcation of arithmetic circuits is limitedby the prohibitively high memory requirement for complexarithmetic circuits, such as multipliers. BDDs are being used,along with many other methods, for local reasoning, but notas monolithic data structure [18]. BMDs and TEDs offer abetter space complexity but require word-level information ofthe design, which is often not available or is hard to extractfrom bit-level netlists. While the canonical diagrams have beenused extensively in logic synthesis, high-level synthesis, andveriﬁcation, their application to verify large arithmetic circuitsremains limited by the prohibitively high memory requirementof complex arithmetic circuits [2] [1].

B. SAT, ILP and SMT Solvers

Arithmetic veriﬁcation problems have been typically mod-eled using Boolean satisﬁability (SAT). Several SAT solvershave been developed to solve Boolean decision problems,including ABC [19], MiniSAT [20], and others. Some ofthem, such as CryptoMiniSAT [21], target speciﬁcally

XOR -rich circuits, and are potentially useful for arithmetic circuitveriﬁcation, but are all based on a computationally expensiveDPLL (Davis, Putnam, Logemann, Loveland) decision pro-cedure [22]. Some techniques combine automatic test patterngeneration (ATPG) and modular arithmetic constraint-solvingtechniques for the purpose of test generation and assertion Our tool and benchmarks used in this journal paper are released publiclyat our project website athttps://ycunxi.github.io/Parallel Formal Analysis GaloisField checking [23]. Others integrate linear arithmetic constraintswith Boolean SAT in a uniﬁed algebraic domain [24], buttheir effectiveness is limited by constraint propagation acrossthe Boolean and word-level boundary. To avoid this problem,methods based on ILP models of arithmetic operators havebeen proposed [25] [26], but in general ILP techniques areknown to be computationally expensive and not scalable tolarge scale systems.

SMT solvers depart from treating the prob-lem in a strictly Boolean domain and integrate different well-deﬁned theories (Boolean logic, bit vectors, integer arithmetic,etc.) into a DPLL-style SAT decision procedure [27]. Some ofthe most effective SMT solvers, potentially applicable to ourproblem, are Boolector [28], Z3 [29], and CVC [30]. However,SMT solvers still model functional veriﬁcation as a decisionproblem and, as demonstrated by extensive experimental re-sults, neither SAT nor SMT solvers can efﬁciently solve theveriﬁcation problem of large arithmetic circuits [1] [10].

C. Theorem Provers

Another class of solvers include Theorem Provers, deduc-tive systems for proving that an implementation satisﬁes thespeciﬁcation, using mathematical reasoning. The proof systemis based on a large and strongly problem-speciﬁc databaseof axioms and inference rules, such as simpliﬁcation, termrewriting, induction, etc. Some of the most popular theoremproving systems are: HOL [31], PVS [32], ACL2 [33], andthe term rewriting method described in [34]. These systemsare characterized by high abstraction and powerful logicexpressiveness, but they are highly interactive, require intimatedomain knowledge, extensive user guidance, and expertise forefﬁcient use. The success of veriﬁcation using theorem proversdepends on the set of available axioms and rewrite rules, andon the choice and order in which the rules are applied duringthe proof process, with no guarantee for a conclusive answer[35].

D. Computer Algebra Approaches

The most advanced techniques that have potential to solvethe arithmetic veriﬁcation problems are those based on Sym-bolic Computer Algebra. The veriﬁcation problem is typicallyformulated as a proof that the implementation satisﬁes thespeciﬁcation [2] [1] [8] [7] [9]. This is accomplished byperforming a series of divisions of the speciﬁcation polynomial F by a set of implementation polynomials B , representingcircuit components, the process referred to as reduction of F modulo B . Polynomials f , ..., f s ∈ B are called the bases, or generators , of the ideal J . Given a set f , ..., f s of generatorsof J , a set of all simultaneous solutions to a system ofequations f =0; ... , f s =0 is called a variety V ( J ) . Veriﬁcationproblem is then formulated as a test if the speciﬁcation F vanishes on V ( J ) , i.e., if F ∈ V ( J ) . This is known incomputer algebra as ideal membership testing problem [1].Standard procedure to test if F ∈ V ( J ) is to dividepolynomial F by the polynomials { f , ..., f s } of B , one byone. The goal is to cancel, at each iteration, the leading term of F using one of the leading terms of f , ..., f s . If the remainder r of the division is 0, then F vanishes on V ( J ) , proving EEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS 3 that the implementation satisﬁes the speciﬁcation. However,if r (cid:54) = 0 , such a conclusion cannot be made; B may notbe sufﬁcient to reduce F to 0, and yet the circuit may becorrect. To reliably check if F is reducible to zero, a canonical set of generators, G = { g , ..., g t } , called Gr¨obner basis ,is needed. It has been shown that for combinational circuitswith no feedback, certain conditions automatically make theset B a Groebner basis [36]. Speciﬁcally, if the polynomials f , ..., f s ∈ B are ordered in reverse topological order of logicgates, from primary outputs to primary inputs, and the leadingterm of each polynomial is the output of a logic gate, then set B is automatically a Groebner basis. Some of the authors useGaussian elimination, rather than explicit polynomial division,to speed up the polynomial reduction process [1] [8]. Thepolynomials corresponding to fanout-free logic cones can beprecomputed to reduce the size of the problem [8].The polynomial reduction technique has been successfullyapplied to both integer arithmetic circuits [9] and Galois ﬁeldarithmetic [1]. Veriﬁcation work of Galois ﬁeld arithmetichas been presented in [1] [5]. Formulation of problems inGF arithmetic takes advantage of known properties of Galoisﬁeld during polynomial reductions. Speciﬁcally, the problemreduces to the ideal membership testing over a larger ideal thatincludes ideal J = (cid:104) x − x (cid:105) in F , for each internal signal x of the circuit. Inclusion of this ideal basically assures thateach signal assumes a binary value. In this paper, we providecomparison between this technique and our approach. E. Function ExtractionFunction extraction is an arithmetic veriﬁcation methodoriginally proposed in [2] for arithmetic circuits in modular in-teger arithmetic Z m . It extracts a unique bit-level polynomialfunction implemented by the circuit directly from its gate-levelimplementation. Instead of expensive polynomial division,extraction is done by backward rewriting , i.e., transformingthe polynomial representing encoding of the primary outputs(called the output signature ) into a polynomial at the primaryinputs (the input signature ) using algebraic models of the logicgates of the circuit. That is, the rewriting is performed in areverse topological order. This technique has been successfullyapplied to large integer arithmetic circuits, such as 512-bitinteger multipliers. However, it is not directly applicable tolarge Galois Field multipliers because of potentially exponen-tial number of polynomial terms, before the internal term can-cellations takes place during rewriting. Fortunately, arithmeticGF( m ) circuits offer an inherent parallelism which can beexploited in backward rewriting, without memory explosion.In the rest of the paper, we ﬁrst describe how to apply suchparallel rewriting in GF( m ) circuits while avoiding memoryexplosion experienced in integer arithmetic circuits. Using thisapproach, we extract the function of each output bit in F m and the function is represented in a pseudo-Boolean polyno-mial expression, where all variables are Boolean. Finally, wepropose a method to reverse engineer the GF( m ) designs byanalyzing these expressions. III. G ALOIS F IELD M ULTIPLICATION

Galois ﬁeld (GF) is a number system with a ﬁnite numberof elements and two main arithmetic operations, addition andmultiplication; other operations such as division can be derivedfrom those two [3]. Galois ﬁeld with p elements is denotedas GF( p ). The most widely-used ﬁnite ﬁelds are Prime Fields and

Extension Fields , and particularly

Binary Extension Fields .Prime ﬁeld, denoted GF( p ), is a ﬁnite ﬁeld consisting ofﬁnite number of integers { , , ...., p − } , where p is a primenumber, with additions and multiplication performed modulop . Binary extension ﬁeld, denoted GF( m ) (or F m ), is aﬁnite ﬁeld with m elements. Unlike in prime ﬁelds, however,the operations in extension ﬁelds are not computed modulo m . Instead, in one possible representation (called polynomialbasis ), each element of GF( m ) is a polynomial ring with m terms with coefﬁcients in GF(2), modulo P ( x ) . Additionof ﬁeld elements is the usual addition of polynomials, withcoefﬁcient arithmetic performed modulo 2. Multiplication ofﬁeld elements is performed modulo irreducible polynomial P ( x ) of degree m and coefﬁcients in GF(2). The irreduciblepolynomial P ( x ) is analogous to the prime number p in primeﬁelds GF ( p ) . In this work, we focus on the veriﬁcation prob-lem of GF( m ) multipliers that appear in many cryptographyand in some DSP applications. A. GF Multiplication Principle

Two different GF multiplication structures, constructed us-ing different irreducible polynomials P ( x ) and P ( x ) , areshown in Figure 1. The integer multiplication takes two n -bit operands as input and generates a n -bit word, where thevalues computed at lower signiﬁcant bits ripple through thecarry chain all the way to the most signiﬁcant bit (MSB). Incontrast, in GF( m ) implementations the number of outputs isreduced to n using irreducible polynomial P(x). The productterms are added for each column (output bit position) modulo2, hence there is no carry propagation. For example, torepresent the result in GF( ), with only four output bits, thefour most signiﬁcant bits in the result of the integer multi-plication have to be reduced to GF( ). The result of such areduction is shown in Figure 1. In GF( ), the input and outputoperands are represented using polynomials A ( x ) , B ( x ) and Z ( x ) , where A ( x ) = (cid:80) n =3 n =0 a n · x n , B ( x ) = (cid:80) n =3 n =0 b n · x n , and Z ( x ) = (cid:80) n =3 n =0 z n · x n , respectively. Example 1:

The function of each multiplicationbit s i ( i ∈ [0, 6]) is represented using polynomialsin GF(2), namely: s = a b , s = a b + a b , etc. ...,up to s = a b . The output bits z n ( n ∈ [0, 3])are computed modulo the irreducible polynomial P ( x ) . Using P ( x ) = x + x +1, we obtain : z = s + s , z = s + s + s , z = a b + a b + a b + a b + a b + a b , and z = a b + a b + a b + a b + a b . The coefﬁcients of themultiplication results are shown in Figure 2. In digitalcircuits, partial products are implemented using AND gates,and addition modulo 2 is done using

XOR gates. Note that,unlike in integer multiplication, in GF( m ) circuits there is For polynomials in GF(2), ”+” are computed as modulo 2.

EEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS 4 no carry out to the next bit. For this reason, as we can seein Figure 1, the function of each output bit can be computedindependently of other bits. a a a a b b b b a b a b a b a b a b a b a b a b a b a b a b a b a b a b a b a b s s s s s s s P ( x ) = x + x + 1 s s s s s s s s s s s s s z z z z P ( x ) = x + x + 1 s s s s s s s s s s z z z z Figure 1:

Two GF( ) multiplications constructed using P ( x ) = x + x + 1 and P ( x ) = x + x + 1 . output polynomial expression z ( a b )+ a b + a b + a b z ( a b + a b )+ a b + a b + a b + a b + a b z ( a b + a b + a b )+ a b + a b + a b z ( a b + a b + a b + a b )+ a b Figure 2: Extracted algebraic expressions of the four outputbits of GF( ) multiplier for P ( x ) = x + x + 1 . B. Irreducible Polynomials

In general, there are various irreducible polynomials thatcan be used for a given ﬁeld size, each resulting in a differ-ent multiplication result. For constructing efﬁcient arithmeticfunctions over GF( m ), the irreducible polynomial is typi-cally chosen to be a trinomial, x m + x a +1, or a pentanomial x m + x a + x b + x c +1 [37]. For efﬁciency reason, coefﬁcients m, a are chosen such that m - a ≥ m/ .An example of constructing GF( ) multiplication using twodifferent irreducible polynomials is shown in Figure 1. Wecan see that each polynomial produces a unique multiplica-tion result. The size of the corresponding multiplier can beestimated by counting the number of XOR operations in eachmultiplication. Since the number of AND and XOR operationsfor generating partial products (variables s i in Figure 1) isthe same, the difference is only caused by the reduction ofthe corresponding polynomials modulo P ( x ) . The number oftwo-input XOR operations introduced by the reduction with P ( x ) can be obtained as the number of terms in each columnminus one. For example, the number of XORs using P ( x ) is 3+1+2+3=9; and using P ( x ) , the number of XORs is1+2+2+1=6.As will be shown in the next section, given the structure ofthe GF( m ) multiplication, such as the one shown in Figure 1,one can readily identify the irreducible polynomial P ( x ) usedduring the GF reduction. This can be done by extracting the terms s k corresponding to the entry s m (here s ) in the tableand generating the irreducible polynomial beyond x m . Weknow that P ( x ) must contain x m , and the remaining terms x k of P ( x ) are obtained from the non-zero terms correspondingto the entry s m . For example, for the irreducible polynomial P ( x ) = x + x + x , the terms x and x are obtained bynoticing the placement of s in columns z and z . Similarly,for P ( x ) = x + x + x , the terms x and x are obtainedby noticing that s is placed in columns z and z . The reasonfor it and the details of this procedure will be explained in thenext section.IV. P ARALLEL E XTRACTION IN G ALOIS F IELD

In this section, we introduce our method for extracting theunique algebraic expressions of the output bits (e.g. Figure2) using computer algebraic method. This can be used toverify the GF( m ) multipliers when the binary encoding ofinputs and output and the irreducible polynomial are given.We introduce a parallel function extraction framework inGF( m ), which allows us to individually extract the algebraicexpression of each output bit. This framework is used forreverse engineering, since our reverse engineering approachis based on analyzing the algebraic expression of output bitsin GF(2), as introduced in Section I. A. Computer Algebraic model

The circuit is modeled as a network of logic elements ofarbitrary complexity, including basic logic gates (AND, OR,XOR, INV) and complex standard cell gates (AOI, OAI, etc.)generated by logic synthesis and technology mapping. Weextend the algebraic model of Boolean operators developed in[10] for integer arithmetic to ﬁnite ﬁeld arithmetic in GF (2) ,i.e., modulo 2. For example, the pseudo-Boolean model ofXOR( a, b )= a + b − ab is reduced to ( a + b + 2 ab ) mod = ( a + b ) mod . The following algebraic equations are used todescribe basic logic gates in GF (2 m ) [1]: ¬ a = 1 + aa ∧ b = a · ba ∨ b = a + b + a · ba ⊕ b = a + b (1) B. Outline of the Approach

Similarly to the work of [2] and [10], the arithmeticfunction computed by the circuits is obtained by transforming(rewriting) the polynomial representing the encoding of theprimary outputs (called output signature ) into the polynomialat the primary inputs, the input signature . The output sig-nature of a GF (2 m ) multiplier, Sig out = (cid:80) m − i =0 z i x i , with z i ∈ GF (2) . The input signature of a GF (2 m ) multiplier, Sig in = (cid:80) m − i =0 P i x i , with coefﬁcients P i ∈ GF (2) beingproduct terms, and addition operation performed modulo 2.If the irreducible polynomial P ( x ) is provided, Sig in isknow; otherwise, it will be computed by backward rewritingfrom Sig out . The goal is to transform the output signature,

Sig out , using polynomial representation of the internal logic

EEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS 5 elements (1), into an input signature

Sig in in GF (2 m ) , whichdetermines the arithmetic function (speciﬁcation) computed bythe circuit. Theorem 1:

Given a combinational arithmetic circuit in GF (2 m ) , composed of logic gates, described by Eq. 1, inputsignature Sig in computed by backward rewriting is unique andcorrectly represents the function implemented by the circuit in GF (2 m ) . Proof:

The proof of correctness relies on the fact that eachtransformation step (rewriting iteration) is correct. That is,each internal signal is represented by an algebraic expression,which always evaluates to a correct value in GF (2 m ) . Thisis guaranteed by the correctness of the algebraic model in Eq.(1), which can be proved easily by inspection. For example, thealgebraic expression of XOR(a,b) in Z m is a + b − ab . Whenimplemented in GF (2 m ) , the coefﬁcients in the expressionmust be in GF (2) , hence XOR(a,b) in GF m is representedby a + b . The proof of uniqueness is done by inductionon i , the step of transforming polynomial F i into F i +1 . Adetailed induction proof of this theorem is provided in [2] forexpressions in Z m . (cid:3) Algorithm 1

Backward Rewriting in GF (2 m ) Input: Gate-level netlist of GF (2 m ) multiplierInput: Output signature Sig out , and (optionally) input signature,

Sig in Output: GF function of the design; return

Sig out == Sig in P = { p , p , ..., p n } : polynomials representing gate-level netlist F = Sig out for each polynomial p i ∈ P do for output variable v of p i in F i do replace every variable v in F i by algebraic expression of p i F i → F i +1 for each monomial M in F i +1 do if the coefﬁcient of M %2==0 or M is a constant, M %2==0 then remove M from F i +1 end if end for end for end for return F n and F n =? Sig in Theorem 1, together with the algebraic model of Booleangates (1), provide the basis for polynomial reduction usingbackward rewriting. This is described by Algorithm 1. Themethod takes the gate-level netlist of a GF( m ) multiplieras input and ﬁrst converts each logic gate into an algebraicexpression using Eq. (1). The rewriting process starts withthe output signature F = Sig out and performs rewriting inreverse topological order, from outputs to inputs. It ends whenall the variables in F i are primary inputs, at which point itbecomes the input signature Sig in [2].Each iteration includes two basic steps: 1) substitute thevariable of the gate output using the expression in the inputsof the gate (Eq.1), and name the new expression F i +1 (lines3 - 6); and 2) simplify the new expression in two ways: a)by eliminating terms that cancel each other (as in the integerarithmetic case [2]), and b) by removing all the monomials(including constants) that reduce to 0 in GF( ) (line 3 andlines 7 - 10). The algorithm outputs the arithmetic functionof the design in GF( m ) after n iterations, where n is thenumber of gates in the netlist. The ﬁnal expression F n = Sig in a0 b1 a1 b0 a1 b1 a0 b0i4 i3 i2 i1i5i6 z1 z012345 6 78 SigoutSigin

Figure 3: The gate-level netlist of post-synthesized andmapped 2-bit multiplier over GF( ). The irreducible poly-nomial is P ( x ) = x + x + 1 . Sig out : F init = z +x z Eliminating termsG8: F = z +x( i + i ) - G7: F = i + i +x( i + i ) - G6: F = i + i +x( i + i + i ) - G5: F = i + i +x( i + i + i +1) - G4: F = i + i +x( i + i + a b )+2x G3: F = i + i +x( i + a b + a b +1) - G2: F = i + a b +1+x( a b + a b + a b )+2x G1: F = a b + a b +2+x( a b + a b + a b ) Sig in : a b + a b +x( a b + a b + a b ) - Figure 4: Function extraction of a 2-bit GF multiplier shownin Figure 3 using backward rewiring from PO to PI.can be used to verify if the circuit performs the desiredarithmetic function by checking if the computed polynomial Sig in matches the expected speciﬁcation, if known. Thisequivalence check can be readily performed using canonicalword-level representations, such as BMD [15] or TED [17]which can efﬁciently check equivalence of two polynomials.Alternatively, if the speciﬁcation is not known, the computedsignature can serve as the speciﬁcation extracted from thecircuit. Example 2 (Figure 3): We illustrate our method using apost-synthesized 2-bit multiplier in GF (2 ) , shown in Figure3. The irreducible polynomial is P ( x ) = x + x + 1 . Theoutput signature is Sig out = z + z x , and input signatureis Sig in = ( a b + a b )+( a b + a b + a b ) x . First, F init = Sig out is transformed into F using polynomial of gate g , z = i + i and simpliﬁed to F = z + i x + i x . Then,the polynomials F i are successively derived from F i +1 andchecked for a possible reduction. The ﬁrst reduction happenswhen F is transformed into F , where i (at gate g ) isreplaced by ( a b ). After simpliﬁcation, a monomial x is identiﬁed and removed by modulo 2 from F . Similarreductions are applied during the transformations F → F and F → F . Finally, the function of the design is extractedas expression F . A complete rewriting process is shown inFigure 4. We can see that F = Sig in , which indicates thatthe circuit indeed implements the GF (2 ) multiplication with P ( x ) = x + x + 1 .An important observation is that the potential reductions EEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS 6 take place only within the expression associated with the samedegree of polynomial ring (

Sig out ). In other words, the reduc-tions happen in a logic cone of every output bit independently of other bits, regardless of logic sharing between the cones.For example, the reductions in F and F happen within thelogic cone of output z only. Similarly, in F , the reductionis within logic cone of z . Details of the proof are providedin [13]. C. Implementation

This section describes the implementation of our parallelveriﬁcation method for Galois ﬁeld multipliers. Our approachtakes the gate-level netlist as input, and outputs the extractedfunction of the design. It includes four steps:

Step1: Convert netlist to equations.

Parse the gate-levelnetlist into algebraic equations based on Equation 1. Theequations are listed in topological order, to be rewritten bybackward rewriting in the next step. m copies of this equationﬁle will be made for a GF( m ) multiplier. Step2: Generate signatures.

Split the output signature ofGF( m ) multipliers into m polynomials, with Sig out i = z i .Insert the new signatures into the m copies of the equationﬁle generated from Step1. Each signature represents a singleoutput bit. Step3: Parallel extraction.

Apply Algorithm 1 to eachequation ﬁle to extract the polynomial expression of eachoutput in parallel. In contrast to work on integer arithmetic[2], the internal expression of each output bit does not offerany polynomial reduction ( monomial cancellations ) with otherbits. Ideally, our approach can extract GF( m ) multiplier in m threads. However, due to the limited computing resources, it isimpossible to extract GF( m ) multipliers in m threads when m is very large. Hence, our approach puts a limit on the numberof parallel threads T (T = 5, 10, 20 and 30 have been testedin this work). This process is illustrated in Figure 5. The m extraction tasks are organized into several task sets, orderedfrom LSB to MSB. In each set, the extractions are performedin parallel. Since the runtime of each extraction within the setcan differ, the tasks in the next set will start as soon as anyprevious task terminated. Step4: Finalization.

Compute the ﬁnal function of themultiplier. Once the algebraic expression of each output bitin GF( ) is computed, our method computes the ﬁnal functionby constructing the Sig out using the rewriting process in step3.Our algorithm uses a data structure that efﬁciently imple-ments iterative substitution and elimination during backwardrewriting. It is similar to the data structure employed infunction extraction for integer arithmetic circuits [2], suitablymodiﬁed to support simpliﬁcations in ﬁnite ﬁelds algebra.Speciﬁcally, in addition to cancellation of terms with oppositesigns, it performs modulo 2 reduction of monomials andconstants. The data structure maintains the record of the terms(monomials) in the expression that contain the variable tobe substituted. It reduces the cost of ﬁnding the terms thatwill have their coefﬁcients changed during substitution. Eachelement represents one monomial consisting of the variables

Eqns of netlist

Sigout = z0

Eqns of netlist

Sigout = z1

Eqns of netlist

Sigout = zm-2

Eqns of netlist

Sigout = zm-1 … z0 z1 z2 … zT-1zT zT+1 zT+2 … z2T-1 … … zm-2 zm-1 Parallel extraction

Figure 5: Step3: parallel extraction of a GF( m ) multiplierwith T threads.in the monomials and its coefﬁcient. The expression datastructure is a C++ object that represents a pseudo-Booleanexpression, which contains of all the elements in the datastructure. It supports both fast addition and fast substitutionwith two C++ maps, implemented as binary search trees, aterms map, and a substitution map. This data structure includestwo cases of simpliﬁcations: 1) after substitution the coefﬁ-cients of all the monomials are updated and the monomialswith coefﬁcient zero are eliminated; 2) the monomials whosecoefﬁcient modulo 2 evaluate to 0 are eliminated. The secondcase is applied after each substitution. Sig out = z elim Sig out =x · z elimG8: z - G8: i x + i x -G7: i + i - G7: i x + i x -G6: i + i - G6: i x +x+ i x -G5: i + i - G5: i x +x+ i x + i x -G4: i + i - G4: i x + x + i x + a b x + x i + i - G3: i x + a b x +x+ a b x -G2: i + a b +1 - G2: a b x + x + a b x + x + a b x + a b + a b + G1: x( a b + a b + a b ) - z = a b + a b , z =x( a b + a b + a b ) Figure 6:

Extracting the algebraic expression of z and z in Fig.4. Example 3 (Figure 6): We illustrate our parallel extractionmethod using a 2-bit multiplier in GF( ) in Figure 3. Theoutput signature Sig out = z + z x is split into two signatures, Sig out = z and Sig out = z . Then, the rewriting process isapplied to Sig out and Sig out in parallel. When Sig out and Sig out have been successfully extracted, the two signaturesare merged into Sig out + x · Sig out , resulting in the polyno-mial Sig in . In Figure 4, we can see that elimination happensthree times ( F , F , and F ). As expected, this happens withineach element in GF( n ). In Figure 6 one elimination in Sig out and two eliminations in Sig out have been done independently,as shown earlier (refer to Example 2).V. R EVERSE E NGINEERING

In this section, we present our approach to perform reverseengineering of GF( m ) multipliers. Using the extraction tech-nique presented in the previous section, we can extract the EEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS 7 algebraic expression of each output bit. In contrast to thealgebraic techniques of [6] [10], our extraction technique canextract the algebraic expression of each output bit indepen-dently. This means that the extraction can be done without theknowledge of the bit position of the inputs and outputs. Twotheorems are provided and proved to support this claim.In a GF( m ) multiplication, let s i ( i ∈ { m -1 } ) be aset of partial products generated by AND gates and combinedwith an XOR operations. For example, in Figure 1, there aresix product sets, s , s , ..., s , where s = a b + a b ; or writtenas a set: s = { a b , a b } , etc. These product sets are dividedinto two groups: those with index i ≤ m − , called in-ﬁeld product sets; and those with index i ≥ m , called out-of-ﬁeld product sets. The in-ﬁeld product sets s i , in this case s , s , s , s , correspond to the output bits z i . The out-of-ﬁeldproduct sets will be reduced into the ﬁeld GF( m ) using mod P ( x ) operation, and assigned to the respective output bit, asdetermined by P ( x ) . In the case of Figure 1, the out-of-ﬁeldsets are s , s , s . In general, for a GF( m ) multiplication, m product sets are in-ﬁeld , and m -1 product sets are out-of-ﬁeld [38]. A. Output Encoding Determination

We will now demonstrate how to determine the encoding,and hence bit position, of the outputs.

Theorem 2:

Given a GF( m ) multiplication, the in-ﬁeldproduct sets ( s , s , ..., s m − ) appear in exactly one element ofGF( m ) each, and the out-of-ﬁeld product sets ( s m , s m +1 , ..., s m − ) appear in at least two elements (outputs) of GF( m ),as a result of reduction mod P ( x ) . Proof:

An irreducible polynomial in GF( m ) has the stan-dard form P ( x ) = x m + P (cid:48) ( x ) , where the tail polynomial P (cid:48) ( x ) contains at least two monomials x d with degree d < m .For example, there are two such monomials for a trinomial,four for pentanomial, etc. Since P ( x ) = 0 we have x m = P (cid:48) ( x ) in GF( m ). Hence the variable x m , associated withthe ﬁrst out-of-ﬁeld partial product set s m will appear in atleast two outputs, determined by P (cid:48) ( x ) . Other variables, x k ,associated with out-of-ﬁeld partial product set s k , for k > m ,can be expressed as x k = x k − m x m = x k − m P (cid:48) ( x ) and willcontain at least two elements. QED (cid:3) In fact, the number of outputs in which the out-of-ﬁeldset s k will appear is equal to the number of monomials inthe above product x k − m P (cid:48) ( x ) , provided that every monomial x j with j > m is recursively reduced mod P (cid:48) ( x ) , i.e., byusing relation x m = P (cid:48) ( x ) . We illustrate this fact with anexample of multiplication in GF( ) using irreducible polyno-mial P ( x ) = x + x + 1 shown in the left side of Figure1. The in-ﬁeld sets, associated with outputs z , z , z , z , are s , s , s , s . Since P ( x ) = x + x + 1 = 0 , we obtain x = x + 1 . This means that set s appears in two outputcolumns, z and z . Then x = x · x = x ( x + 1) = x + x = x + x + 1 , which means that s appears in three outputs: z , z , z .Finally, x = x · x = x ( x + x + 1) = x + x + x = x + x + x + 1 , that is, s will appear in four outputs: z , z , z , z . Asexpected, this matches the left Table in Figure 1. Note therecursive derivation of x k for k > m , which increases thenumber of columns to which a given set s k is assigned.Based on Theorem 2, we can ﬁnd the in-ﬁeld productsets, s , s , ..., s m − , by searching the unique products inthe resulting algebraic expressions of the output bits. In thiscontext, unique products are the products that exist in onlyone of the extracted algebraic expressions. Since the in-ﬁeldproduct set indicates the bit position of the output, we candetermine the bit positions of the output bits as soon as all thein-ﬁeld product sets are identiﬁed. Example 4 (Figure 2):

We illustrate the procedure of de-termining bit positions with an example of a GF( ) multiplierimplemented using irreducible polynomial P ( x ) = x + x + (see Figure 1). Note that in this process the labels do notoffer any knowledge of the bit positions of inputs and outputs.The extracted algebraic expressions of the four output bits areshown in Figure 2. The labels of the variables do not indicateany binary encoding information. We ﬁrst identify the uniqueproducts that include set s = a b in algebraic expression of z ;set s =( a b + a b ) in z ; set s =( a b + a b + a b ) in z ; andset s =( a b + a b + a b + a b ) in z . Note that the numberof products in the in-ﬁeld product set s i is i . Hence, we ﬁndall the in-ﬁeld product sets and their relation to the extractedalgebraic to be as follows: s = a b , z → Least signiﬁcant bit (LSB) s = a b + a b , z → nd output bit s = a b + a b + a b , z → rd output bit s = a b + a b + a b + a b , z → Most signiﬁcant bit (MSB)

B. Input Encoding Determination

Algorithm 2

Input encoding determination for GF (2 m ) Input: a set of algebraic expressions represent the in-ﬁeld product sets S Output: bit position of input variables S = { s , s , ..., s m − } initialize a vector of variables V ← {} for i=0, i ≤ m-1, i++ do for each variable v in algebraic expression of s i do if v does not exist in V then assign bit position value of v = i store v in variable set V end if end for end for return V We can now determine the bit position of the input variablesusing the procedure outlined in Algorithm 2. The input bit po-sition can be determined by analyzing the in-ﬁeld product sets,obtained in the previous step. Based on the GF multiplicationalgorithm, we know that s is generated by an AND functionwith two LSBs of the two inputs; and the two products in s are generated by the AND and XOR operations using twoLSBs and two nd input bits, etc. For example in a GF( )multiplication (Figure 1), s = a b , where a and b are LSBs; s = a b + a b , where a , b are LSBs; a , b are nd LSBs.This allows us to determine the bit position of the input bitsrecursively by analyzing the algebraic expression of s i . We EEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS 8 illustrate this with the GF( ) multiplier implemented using P ( x ) = x + x + (Figure 2). Example 5 (Algorithm 2):

The input of our algorithm isa set of algebraic expressions of the in-ﬁeld product sets, s , s , s , s (line 1). We initialize vector V to store thevariables in which their bit positions are assigned (line 2).The ﬁrst algebraic expression is s . Since the two variables, a and b are not in V , the bit positions of these two variablesare assigned index i = 0 (line 4-8). In the second iteration, V = { a , b } , and the input algebraic expression is s , includingvariables a , b , a and b . Because a and b are not in V , their bit position is i = 1 . The loop ends when all thealgebraic expressions in S have been visited, and returns V = { ( a , b ) , ( a , b ) , ( a , b ) , ( a , b ) } . The subscriptsare the bit position values of the variables returned by thealgorithm. Note that this procedure only gives the bit positionof the input bits; the information of how the input words areconstructed is unknown. There are m − combinations fromwhich the words can be constructed using the informationreturned in V . For example, the two input words can be W = a +2 a +4 b +8 a and W = b +2 b +4 a +8 b ; or they canbe W (cid:48) = a +2 a +4 b +8 b and W (cid:48) = b +2 b +4 a +8 a . Althoughthere may be many combinations for constructing the inputwords, the speciﬁcation of the GF( m ) is unique. C. Extraction of the Irreducible Polynomial

Theorem 3:

Given a multiplication in GF( m ), let theﬁrst out-of-ﬁeld product set be s m . Then, the irreduciblepolynomial P ( x ) includes monomials x m and { x i } iff allproducts in the set s m appear in the algebraic expression ofthe i th output bits, for all i < m . Proof:

Based on the deﬁnition of ﬁeld arithmetic forGF( m ), the polynomial basis representation of s m is x m s m .To reduce s m into elements in the range [0, m − ], the ﬁeldreductions are performed modulo irreducible polynomial P ( x ) with highest degree m ( c.f. the proof of Theorem 2). Asbefore, let P ( x ) = x m + P (cid:48) ( x ) . Then, x m s m mod ( x m + P (cid:48) ( x )) = s m P (cid:48) ( x ) Hence, if x i exists in P (cid:48) ( x ) , it also exists in P ( x ) . Therefore, x i exists in P ( x ) , iff x i s m exists in x m s m mod P ( x ) . (cid:3) Even though the input bit positions have been determinedin the previous step, we cannot directly generate s m sincethe combination of the input bits for constructing the inputwords is still unknown. In Example 5 ( m =4), we can see that s m = { a b , a b , a b } when input words are W and W ; but s m = { a a , a b , b b } when inputs words are W (cid:48) and W (cid:48) .To overcome this limitation, we create a set of products s (cid:48) m ,which includes all the possible products that can be generatedbased on all input combinations. The set s (cid:48) m includes the true products, i.e., those that exist in the ﬁrst out-of-ﬁeld productset; and it also includes some dummy products. The dummyproducts are those that never appear in the resulting algebraicexpressions. Hence, we ﬁrst generate the set s (cid:48) m and eliminatethe dummy products by searching the algebraic expressions.After this, we obtain s m . Then, we use s m to extract theirreducible polynomial P ( x ) using Algorithm 3. Example 6:

We illustrate the method of reverse engineeringthe irreducible polynomial using the GF (2 ) multiplier of Fig.1. The procedure is outlined in Algorithm 3. The extractedalgebraic expressions S (line 1 at Algorithm 3) is shownin Figure 2. The bit position of input bits is determined byAlgorithm 2 (line 2). Based on the result of Algorithm 2, wegenerate s (cid:48) m = { a a , b b , a b , a b , a b } . To eliminate thedummy products from s (cid:48) m , we search all algebraic expressionsin S , and eliminate the products that cannot be part of theresulting products. In this case, we ﬁnd that a a and b b arethe dummy products. Hence, we get s m = { a b , a b , a b } (line 3). Based on the deﬁnition of irreducible polynomial, P ( x ) must include x m ; in this example m = 4 (line 4). Whilelooping over all the algebraic expressions, the expressions for z and z contain all the products of s m . Hence, x and x are included in P ( x ) , so that P ( x ) = x + x + x . We can seethat it is the same as P ( x ) in Figure 1. Algorithm 3

Extracting irreducible polynomial in GF (2 m ) Input: the algebraic expressions of output bits S Input: the ﬁrst out-of-ﬁeld product set s m Output: Irreducible polynomial P ( x ) S = { exp , exp , ..., exp m − } V ← Algorithm 2( S ) s m ← eliminate dummy ( s (cid:48) m ← V , S ) P ( x ) = x m : initialize irreducible polynomial for i=0, i ≤ m-1, i++ do if all products in s m exist in exp i then P ( x ) += x i end if end for return P ( x ) In summary, using the framework presented in Section IV-C,we ﬁrst extract the algebraic expressions of all output bits.Then, we analyze the algebraic expressions to ﬁnd the bitposition of the input bits and the output bits, and extract theirreducible polynomial P ( x ) . In the example of the GF( )multiplier implemented using P ( x ) = x + x + , shown inFigure 1, the ﬁnal results returned by our approach givesthe following: 1) the input bits set V = { ( a , b ) , ( a , b ) , ( a , b ) , ( a , b ) } , where the subscripts represent the bitposition; 2) z is the least signiﬁcant bit (LSB), z is the nd output bit, z is the rd output bit, and z is the mostsigniﬁcant bit (MSB); 3) irreducible polynomial is P ( x ) = x + x + ; 4) the speciﬁcation can be veriﬁed using the ap-proach presented in Section IV with the reverse engineeredinformation. VI. R ESULTS

The experimental results of our method are presentedin two subsections: 1) evaluation of parallel veriﬁcation ofGF( m ) multipliers; and 2) evaluation of reverse engineeringof GF( m ) multipliers. The results given in this section includedata (total time and maximum memory) for the entire veriﬁca-tion or reverse engineering process, including translating thegate-level verilog netlist to the algebraic equation, performingbackward rewriting and other required functions. A. Parallel Veriﬁcation of GF( m ) Multipliers The veriﬁcation technique for GF( m ) multipliers presentedin Section IV was implemented in C++. It performs backward EEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS 9

Mastrovito [5] This workOp size

T=1 T=5 T=10 T=20 T=30 T=1*

32 5,482 1 3 5 2 1 1 1 10 MB48 12,228 8 13 9 6 3 3 2 21 MB64 21,814 29 21 19 11 8 7 7 37 MB96 51,412 195 45 68 38 26 20 23 84 MB128 93,996 924 91 153 91 63 55 57 152 MB163 153,245 3546 161 336 192 137 121 113 248 MB233 167,803 4933 168 499 294 212 180 171 270 MB283 399,688 30358 380 1580 890 606 550 530 642 MB571 1628,170 TO - 13176 7980 5038 MO MO

TABLE I: Results of verifying Mastrovito multipliers using our parallel approach. T is the number of threads. M O =Memoryout of 32 GB.

T O =Time out of 18 hours.(*

T=1 shows the maximum memory usage of a single thread.)

Montgomery [5] This workOp size

T=1 T=5 T=10 T=20 T=30 T=1*

32 4,352 2 3 5 3 2 1 2 8 MB48 9,602 14 13 34 18 11 9 6 16 MB64 16.898 63 21 80 45 31 28 27 27 MB96 37,634 554 45 414 234 157 133 142 59 MB128 66,562 1924 68 335 209 121 115 110 95 MB163 107,582 12063 101 2505 1616 1172 1095 1008 161 MB233 219,022 TO

168 1240 722 565 457 480 301 MB283 322,622 TO

380 32180 19745 17640 15300 14820 488 MB

TABLE II: Results of verifying

Montgomery multipliers using our parallel approach. T is the number of threads. T O =Timeout of 18 hours.(*

T=1 shows the maximum memory usage of a single thread.)rewriting with variable substitution and polynomial reductionsin Galois ﬁeld in parallel fashion. The program was testedon a number of combinational gate-level GF (2 m ) multiplierstaken from [6], including the Montgomery multipliers [39] andMastrovito multipliers [40]. The bit-width of the multipliersvaries from 32 to 571 bits. The veriﬁcation results for variousGalois ﬁeld multipliers obtained using SAT, SMT, ABC [41],and Singular [42], have already been presented in works of[1] and [6]. They clearly demonstrate that techniques basedon computer algebra perform signiﬁcantly better than otherknown techniques. Hence, in this work, we only compare ourapproach to those two, and speciﬁcally to the tool describedin [6]. However, in contrast to the previous work on Galoisﬁeld veriﬁcation, all the GF( m ) multipliers used in thispaper are bit-blasted gate-level implementations. The bit-levelmultipliers are taken from [6] and mapped onto gate-levelcircuits using ABC [41]. Our experiments were conductedon a PC with Intel(R) Xeon CPU E5-2420 v2 2.20 GHz ×

12 with 32 GB memory. As described in the next section,our technique can verify Galois ﬁeld multipliers in multiplethreads by applying Algorithm 1 to each output bit in parallel.The number of threads is given as input to the tool.The experimental results of our approach and comparisonwith [6] are shown in Table I for gate-level Mastrovitomultipliers with bit-width varying from 32 to 571 bits. Thesemultipliers are directly mapped using ABC without any opti-mization. The largest circuit includes over 1.6 million gates.This is also the number of polynomial equations and thenumber of rewriting iterations (see Section IV). The resultsgenerated by the tool, presented in [6] are shown in columns3 and 4 of Table I. We performed four different series ofexperiments, with the number of threads T varying from 5 to 30. The table shows CPU runtime and memory usage fordifferent values of T . The timeout limit (TO) was set to 12hours and memory limit (MO) to 32 GB. The experimentalresults show that our approach provides on average 26.2 × ,37.8 × , 42.7 × , and 44.3 × speedup, for T =

5, 10, 20, and 30threads, respectively. Our approach can verify the multipliersup to 571 bit-wide multipliers in 1.5 hours, while that of [6]fails after 12 hours.The reported memory usage of our approach is the max-imum memory usage per thread . This means that our toolexperiences maximum memory usage with all T threadsrunning in the process; in this case, the memory usage is T · M em . This is why the 571-bit Mastrovito multipliers couldbe successfully veriﬁed with T = 5 and 10, but failed with T = 20 and 30 threads. For example, the peak memory usage of571-bit Mastrovito multiplier with T = 20 is . ×

20 = 52

GB, which exceeds the available memory limit.We also tested Montgomery multipliers with bit-width vary-ing from 32 to 283 bits; the results are shown in TableII. These experiments are different than those in [6]. Inour work, we ﬁrst ﬂatten the Montgomery multipliers beforeapplying our veriﬁcation technique. That is, we assume thatonly the positions of the primary inputs and outputs areknown, without the knowledge of any high-level structure.In contrast, [6] veriﬁes the Montgomery multipliers that arerepresented with four hierarchical blocks. For 32- to 163-bitMontgomery multipliers, our approach provides on average a9.2 × , 15.9 × , 16.6 × , and 17.4 × speedup, for T =

5, 10, 20,and 30, respectively. Notice that [6] cannot verify the ﬂattenedMontgomery multipliers larger than 233 bits in 12 hours.Analyzing Table I we observe that the rewriting technique ofour approach when applied to Montgomery multipliers require

EEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS 10 signiﬁcantly more time than for Mastrovito multipliers. Themain reason for this difference is the internal architecture ofthe two multiplier types. Mastrovito multipliers are obtaineddirectly from the standard multiplication structure, with thepartial product generator followed by an XOR-tree structure,as in modular arithmetic. Since the algebraic model of XOR inGF arithmetic is linear, the size of the polynomial expressionsgenerated during rewriting of this architecture is relativelysmall. In contrast, in a Montgomery multiplier the two inputsare ﬁrst transformed into Montgomery form; the products ofthese Montgomery forms are called

Montgomery products .Since the polynomial expressions in Montgomery forms aremuch larger than partial products, the increase in size ofintermediate expressions is unavoidable. Dependence on P ( x ) : In Table II, we observe that CPUruntime for verifying a 163-bit multiplier is greater than thatof a 233-bit multiplier. This is because the computational com-plexity depends not only on the bit-width of the multiplier, butalso on the irreducible polynomial P ( x ) used in constructingthe multiplier.We illustrate this fact using two GF( ) multiplicationsimplemented using two different irreducible polynomials (c.f.Figure 1). We can see that for P ( x ) = x + x + 1 , the longestlogic paths for z and z , include ten and seven productsthat need to be generated using XORs, respectively. However,when P ( x ) = x + x + 1 , the two longest paths, z and z ,have only seven and six products. This means that the GF( )multiplication requires 9 XOR operations using P ( x ) and 6XOR operations using P ( x ) . In other words, the gate-levelimplementation of the multiplier implemented using P ( x ) hasmore gates compared to P ( x ) . In conclusion, we can see thatirreducible polynomial P ( x ) has signiﬁcant impact on bothdesign cost and the veriﬁcation time of the GF( m ) multipliers. R un t i m e M e m o r y Mont-RuntimeMastr-RuntimeMont-MemoryMastr-Memory

Figure 7: Runtime and memory usage of parallel veriﬁcationapproach as a function of the number of threads T . Runtime vs. Memory : In this section, we discuss thetradeoff of runtime and memory usage of our parallel approachto Galois Field multiplier veriﬁcation. The plots in Figure 7show the average runtime and memory usage for differentnumber of threads, over the set of multipliers shown in TablesI and II (32 to 283 bits). The vertical axis on the left isCPU runtime (in seconds), and on the right is memory usage (MB). The horizontal axis represents the number of threads T ,ranging from 1 to 30. The runtime is signiﬁcantly improved for T ranging from 5 to 15. However, there is not much speedupwhen T is greater than 20, most likely due to the memorymanagement synchronization overhead between the threads.Similarly to the results for Mastrovito multipliers (Table I),our approach is limited here by the memory usage when thesize of the multiplier and the number of threads T are large.In our work, T = 20 seems to be the best choice. Obviously, T varies for different platforms, depending on the number ofcores and the memory.We also analyzed the runtime complexity of our veriﬁcationalgorithm for a single thread (T=1) computation; it is shownin Figure 8. The y-axis shows the total runtime of rewritingthe polynomial expressions, and x-axis indicates the size ofthe Mastrovito multiplier. The result shows that the overallspeedup is roughly the same for each value of T. Montgomerymultipliers exhibit similar behavior, regardless of the choiceof the irreducible polynomial. R un t i m e ( sec ) Size of the Mastrovito Multiplier T=1 T=5 T=10

Figure 8:

Single thread runtime analysis for Mastrovito multipliers. Effect of Synthesis on Veriﬁcation : In [10] the authorsconclude that highly bit-optimized integer arithmetic circuitsare harder to verify than their original, pre-synthesized netlists.This is because the efﬁciency of the rewriting technique relieson the amount of cancellations between the different terms ofthe polynomial, and such cancellations strongly depend on theorder in which signals are rewritten. A good ordering of signalsis difﬁcult to achieve in highly bit-optimized synthesizedcircuits.To see the effect of synthesis on parallel veriﬁcation ofGF circuits, we applied our approach to post-synthesized

Galois ﬁeld multipliers with operands up to 409 bits (571-bit multipliers could not be synthesized in a reasonable time).We synthesized

Mastrovito and

Montgomery multipliers using

ABC tool [41]. We repeatedly used the commands resyn2 and dch until the number of AIG levels or nodes couldnot be reduced anymore. The synthesized multipliers weremapped using a 14nm technology library. The veriﬁcationexperiments shown in Table III are performed by our toolwith T = 20 threads. Our tool was able to verify both 409-bit Mastrovito and

Montgomery multipliers within just 13minutes. We observed that in our parallel approach Galoisﬁeld multipliers are easier to be veriﬁed after optimization ”dch” is the most efﬁcient bit-optimization function in ABC. EEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS 11 than in their original form. For example, the veriﬁcation ofa 283-bit Montgomery multiplier takes 15,300 seconds for T =

20. After optimization, the runtime dropped to just 169.2seconds, which means that such a veriﬁcation is 90x fasterthan of the original implementation. The memory usage hasalso been reduced from 488 MB to 194 MB. In summary, incontrast to veriﬁcation problems of integer multipliers [10],the bit-level optimization actually reduces the complexity ofbackward rewriting process. This is because extracting thefunction of an output bit of a GF multiplier depends only onthe logic cone of that bit and does not require logic expressionfrom other bits to be simpliﬁed (c.f. Theorem 3). Hence, thecomplexity of function extraction is naturally reduced if thelogic cone is minimized.

Op size Mastrovito Montgomery

TABLE III: Runtime and memory usage of synthesized

Mas-trovito and

Montgomery multipliers ( T =20). B. Reverse Engineering of GF( m ) Multipliers The reverse engineering technique presented in this paperwas implemented in the framework described in Section V inC++. It reverse engineers bit-blasted GF( m ) multipliers byanalyzing the algebraic expressions of each element using theapproach presented in Section IV. The program was testedon a number of gate-level GF (2 m ) multipliers with differentirreducible polynomials, including Montgomery multipliersand Mastrovito multipliers. The multiplier generator, takenfrom [1], takes the bit-width and the irreducible polynomialas inputs and generates the multipliers in the equation format.The experimental results show that our technique can success-fully reverse engineer various GF( m ) multipliers, regardlessof the GF( m ) algorithm and the irreducible polynomials.We set the number of threads to 16 for all the reverseengineering evaluations in this section. This is dictated by thefact that T=16 gives most promising performance (runtime)and scalability (memory usage) metrics on our platform, basedon the analysis presented in Section VI-A2 (Figure 7). m P ( x ) Mastrovito-syn Montgomery-syn

T(s) Mem T(s) Mem64 x + x + x + x +1 13 25 MB 5 20 MB163 x + x + x + x +1 69 508 MB 221 610 MB233 x + x +1 152 1.2 GB 154 2.9 GB409 x + x +1 825 6.5 GB 855 10.3 GB TABLE IV:

Results of reverse engineering synthesized and technol-ogy mapped Mastrovito and Montgomery multipliers.

Our program takes the netlist/equations of the GF( m ) im-plementations, and the number of threads as input. Hence, theusers can adjust the parallel efforts depending on the limitationof the machines. In this work, all results are performed in 16 threads. Typical designs that require reverse engineeringare those that have been bit-optimized and mapped using astandard-cell library. Hence, we apply our technique to thebit-optimized Mastrovito and Montgomery multipliers (TableIV). For the purpose of our experiments, the multipliers areoptimized and mapped using ABC [41]. Compared to theveriﬁcation runtime of synthesized multipliers (Table III), theCPU time spent on analyzing the extracted expressions forreverse engineering is less than 10% of the extraction process.This is because most computations of reverse engineeringapproach are associated with those for extracting the algebraicexpressions, as presented in Section VI-A2, Table III. R un t i m e ( s ) x^233+x^201+x^105+x^9+1x^233+x^185+x^121+x^105+1x^233+x^74+1x^233+x^159+1 Figure 9:

Result of reverse engineering GF( ) Mastrovito multi-pliers implemented with different P(x). The reverse engineering approach has been further evaluatedusing four Mastrovito multipliers, each implemented witha different irreducible polynomial P ( x ) in GF( ). Thepolynomials are obtained from [43] and optimized using ABCsynthesis tool. The results are shown in Figure 9. We can seethat the multipliers implemented with trinomial P ( x ) are mucheasier to be reverse engineered than those based on a pen-tanomial P ( x ) . This is because the multipliers implementedwith pentanomial P ( x ) contain more gates and have longercritical path, since the reduction over pentanomial requiresmore XOR operations. The CPU runtime for irreducible poly-nomial of the same class (trinomials or pentanomials) is almostthe same. As discussed in Section III-B, comparison of thetwo trinomials shows that the efﬁcient trinomial irreduciblepolynomial, x m + x a +1, typically satisﬁes m - a>m/ .The results for designs synthesized with 14nm technol-ogy library are shown in Figure 10. It shows that thearea and delay of the Mastrovito multiplier implementedwith P ( x ) = x + x + are 5.7% and 7.4% less than for P ( x ) = x + x + , respectively.Figure 10: Evaluation of the design cost using GF( ) Mas-trovito multipliers with irreducible polynomials x + x + and x + x + . VII. C

ONCLUSION

This paper presents a parallel approach to veriﬁcation andreverse engineering of gate-level Galois Field multipliers using

EEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS 12 computer algebraic approach. It introduces a parallel rewritingmethod that can efﬁciently extract functional speciﬁcationof Galois Field multipliers as polynomial expressions. Wedemonstrate that compared to the best known algorithms, ourapproach tested on T =30 threads provides on average 44 × and17 × speedup in veriﬁcation of Montgomery and Mastrovitomultipliers, respectively. We presented a novel approach thatreverse engineers the gate-level Galois Field multipliers, inwhich the irreducible polynomial, as well as the bit positionof the inputs and outputs are unknown. We demonstrated thatour approach can efﬁciently reverse engineer the Galois Fieldmultipliers implemented using different irreducible polynomi-als. Future work will focus on formal veriﬁcation of primeﬁeld arithmetic circuits and complex cryptography circuits.A CKNOWLEDGMENT

The authors would like to thank Prof. Kalla, University ofUtah, for his valuable comments and the benchmarks; andDr. Arnaud Tisserand, University Rennes 1 ENSSAT, for hisvaluable discussion. This work has been funded by NSF grants,CCF-1319496 and CCF-1617708.R

EFERENCES[1] J. Lv, P. Kalla, and F. Enescu, “Efﬁcient Grobner Basis Reductions forFormal Veriﬁcation of Galois Field Arithmatic Circuits,”

IEEE Trans.on CAD , vol. 32, no. 9, pp. 1409–1420, September 2013.[2] M. Ciesielski, C. Yu, W. Brown, D. Liu, and A. Rossi, “Veriﬁcation ofGate-level Arithmetic Circuits by Function Extraction,” in .ACM, 2015, pp. 52–57.[3] C. Paar and J. Pelzl,

Understanding cryptography: a textbook forstudents and practitioners . Springer Science & Business Media, 2009.[4] M. Ciet, J.-J. Quisquater, and F. Sica, “A short note on irreducibletrinomials in binary ﬁelds,” in , 2002.[5] T. Pruss, P. Kalla, and F. Enescu, “Equivalence Veriﬁcation of LargeGalois Field Arithmetic Circuits using Word-Level Abstraction viaGr¨obner Bases,” in

DAC’14 , 2014, pp. 1–6.[6] ——, “Efﬁcient symbolic computation for word-level abstraction fromcombinational circuits for veriﬁcation over ﬁnite ﬁelds,”

IEEE Trans. onCAD of Integrated Circuits and Systems , vol. 35, no. 7, pp. 1206–1218,2016.[7] E. Pavlenko, M. Wedler, D. Stoffel, W. Kunz, A. Dreyer, F. Seelisch,and G. Greuel, “Stable: A new qf-bv smt solver for hard veriﬁcationproblems combining boolean reasoning with computer algebra,” in

DATE , 2011, pp. 155–160.[8] F. Farahmandi and B. Alizadeh, “Groebner basis based formal ver-iﬁcation of large arithmetic circuits using gaussian elimination andcone-based polynomial extraction,”

Microprocessors and Microsystems ,vol. 39, no. 2, pp. 83–96, 2015.[9] A. Sayed-Ahmed, D. Große, U. K¨uhne, M. Soeken, and R. Drechsler,“Formal veriﬁcation of integer multipliers by combining grobner basiswith logic reduction,” in

DATE’16 , 2016, pp. 1–6.[10] C. Yu, W. Brown, D. Liu, A. Rossi, and M. J. Ciesielski, “Formalveriﬁcation of arithmetic circuits using function extraction,”

IEEE Trans.on CAD of Integrated Circuits and Systems , vol. 35, no. 12, pp. 2131–2142, 2016.[11] C. Yu and M. J. Ciesielski, “Automatic word-level abstraction ofdatapath,” in

IEEE International Symposium on Circuits and Systems,ISCAS 2016, Montr´eal, QC, Canada , 2016, pp. 1718–1721.[12] A. Sayed-Ahmed, D. Große, M. Soeken, and R. Drechsler, “Equivalencechecking using grobner bases,”

FMCAD’2016 , 2016.[13] C. Yu and M. J. Ciesielski, “Efﬁcient parallel veriﬁcation of galois ﬁeldmultipliers,”

ASP-DAC’17 , 2017.[14] R. E. Bryant, “Graph-based algorithms for boolean function manipula-tion,”

IEEE Trans. on Computers , vol. 100, no. 8, pp. 677–691, 1986.[15] R. E. Bryant and Y. Chen, “Veriﬁcation of arithmetic circuits with binarymoment diagrams,” in

Proceedings of the 32st Conference on DesignAutomation, San Francisco, California, USA, Moscone Center, June 12-16, 1995. , 1995, pp. 535–541. [16] Y.-A. Chen and R. Bryant, “*PHDD: An Efﬁcient Graph Representationfor Floating Point Circuit Veriﬁcation,” School of Computer Science,Carnegie Mellon University, Tech. Rep. CMU-CS-97-134, 1997.[17] M. Ciesielski, P. Kalla, and S. Askar, “Taylor Expansion Diagrams: ACanonical Representation for Veriﬁcation of Data Flow Designs,”

IEEETrans. on Computers , vol. 55, no. 9, pp. 1188–1201, Sept. 2006.[18] R. Kaivola, R. Ghughal, N. Narasimhan, A. Telfer, J. Whittemore,S. Pandav, A. Slobodov´a, C. Taylor, E. R. V. Frolov, and A. Naik., “Re-placing Testing with Formal Veriﬁcation in Intel CoreTM i7 ProcessorExecution Engine Validation,” in

Computer Aided Veriﬁcation (CAV) .Springer, 2009, pp. 414–429.[19] A. Mishchenko et al. , “ABC: A System for Sequential Synthesis andVeriﬁcation,” , 2007.[20] N. Sorensson and N. Een, “Minisat v1. 13-a sat solver with conﬂict-clause minimization,”

SAT , vol. 2005, p. 53, 2005.[21] M. Soos, “Enhanced Gaussian Elimination in DPLL-based SATSolvers.” in

POS@ SAT , 2010, pp. 2–14.[22] M. Davis, G. Logemann, and D. Loveland, “A machine program fortheorem-proving,”

Communications of the ACM , vol. 5, no. 7, pp. 394–397, 1962.[23] C.-Y. Huang and K.-T. Cheng, “Using Word-level ATPG and ModularArithmetic Constraint-Solving Techniques for Assertion Property Check-ing,”

IEEE Trans. on CAD , vol. 20, no. 3, pp. 381–391, March 2001.[24] F. Fallah, S. Devadas, and K. Keutzer, “Functional vector generationfor hdl models using linear programming and 3-satisﬁability,” in

DesignAutomation Conference (DAC) . IEEE, 1998, pp. 528–533.[25] R. Brinkmann and R. Drechsler, “RTL-datapath Veriﬁcation using Inte-ger Linear Programming,” in

Proceedings of the 2002 Asia and SouthPaciﬁc Design Automation Conference (ASP-DAC) . IEEE ComputerSociety, 2002, p. 741.[26] Z. Zeng, K. R. Talupuru, and M. Ciesielski, “Functional Test GenerationBased on Word-level SAT,”

Journal of Systems Architecture , vol. 51,no. 8, pp. 488–511, 2005.[27] A. Biere, M. Heule, and H. van Maaren,

Handbook of satisﬁability . iospress, 2009, vol. 185.[28] A. Niemetz, M. Preiner, and A. Biere, “Boolector 2.0,”

Journal onSatisﬁability, Boolean Modeling and Computation , vol. 9, 2015.[29] L. De Moura and N. Bjørner, “Z3: An efﬁcient smt solver,” in

Tools andAlgorithms for the Construction and Analysis of Systems . Springer,2008, pp. 337–340.[30] C. Barrett, C. L. Conway, M. Deters, L. Hadarean, D. Jovanovi´c, T. King,A. Reynolds, and C. Tinelli, “CVC4,” in

Computer aided veriﬁcation(CAV) . Springer, 2011, pp. 171–177.[31] M. J. Gordon and T. F. Melham, “Introduction to HOL A TheoremProving Environment for Higher Order Logic,” in

Cambridge UniversityPress , 1993.[32] S. Owre, J. M. Rushby, and N. Shankar, “PVS: A Prototype VeriﬁcationSystem,” in

Automated Deduction - CADE-11 . Springer, 1992, pp. 748–752.[33] B. Brock, M. Kaufmann, and J. S. Moore, “Acl2 theorems aboutcommercial microprocessors,” in

Formal Methods in Computer-AidedDesign (FMCAD) . Springer, 1996, pp. 275–293.[34] S. Vasudevan, V. Viswanath, R. W. Sumners, and J. A. Abraham,“Automatic Veriﬁcation of Arithmetic Circuits in RTL using StepwiseReﬁnement of Term Rewriting Systems,”

IEEE Trans. on Computers ,vol. 56, no. 10, pp. 1401–1414, 2007.[35] D. Kapur and M. Subramaniam, “Mechanical Veriﬁcation of AdderCircuits using Rewrite Rule Laboratory,”

Formal Methods in SystemDesign (FMCAD) , vol. 13, no. 2, pp. 127–158, 1998.[36] D. Stoffel and W. Kunz, “Equivalence Checking of Arithmetic Circuitson the Arithmetic Bit Level,”

IEEE Trans. on CAD , vol. 23, no. 5, pp.586–597, May 2004.[37] NIST, “Recommended elliptic curves for federal government use,” 1999.[38] C. Yu, D. Holcomb, and M. Ciesielski, “Reverse engineering of irre-ducible polynomials in gf (2 m) arithmetic,” in . IEEE, 2017, pp.1558–1563.[39] C. K. Koc and T. Acar, “Montgomery multiplication in gf (2k),”

Designs,Codes and Cryptography , vol. 14, no. 1, pp. 57–69, 1998.[40] B. Sunar and C¸ . K. Koc¸, “Mastrovito multiplier for all trinomials,”

Computers, IEEE Transactions on , vol. 48, no. 5, pp. 522–527, 1999.[41] A. Mishchenko et al. , “Abc: A system for sequential synthesis andveriﬁcation,” , 2007.[42] W. Decker, G.-M. Greuel, G. Pﬁster, and H. Sch¨onemann, “S

INGULAR

EEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS 13 [43] M. Scott, “Optimal irreducible polynomials for gf (2m) arithmetic.”