Area- Efficient VLSI Implementation of Serial-In Parallel-Out Multiplier Using Polynomial Representation in Finite Field GF(2m)
AArea- Efficient VLSI Implementation of Serial-In Parallel-Out Multiplier Using Polynomial Representation in Finite Field GF ( m ) Saeideh Nabipour * , Gholamreza Zare Fatin, Javad Javidan Department Computer and Electrical Engineering, University of Mohaghegh Ardabili, Ardabil, IRAN * Corresponding author: [email protected]
Abstract
Finite field multiplier is mainly used in elliptic curve cryptography, error-correcting codes and signal processing. Finite field multiplier is regarded as the bottleneck arithmetic unit for such applications and it is the most complicated operation over finite field
GF(2 m ) which requires a huge amount of logic resources. In this paper, a new modified serial-in parallel-out multiplication algorithm with interleaved modular reduction is suggested. The proposed method offers efficient area architecture as compared to proposed algorithms in the literature. The reduced finite field multiplier complexity is achieved by means of utilizing logic NAND gate in a particular architecture. The efficiency of the proposed architecture is evaluated based on criteria such as time (latency, critical path) and space (gate-latch number) complexity. A detailed comparative analysis indicates that, the proposed finite field multiplier based on logic NAND gate outperforms previously known results. Keywords:
Finite field multiplication, polynomial basis, bit-serial multiplier, hardware architecture, irreducible polynomial . Introduction
The finite field arithmetic has recently gained considerable attention in many important areas such as coding theory, the implementation of error correcting codes (ECC), cryptography, computer algebra, and digital signal processing. A finite field GF (2 m ) is an algebraic set structure which contains m elements upon which diverse arithmetic operations such as addition, multiplication, inversion, and squaring can be performed. Among them, multiplication is the most demanding and time consuming operation which is frequently used in exponentiation, division, and multiplicative inversion while addition is simpler than other finite field operations as there is no carry propagation, and it can be readily carried out with two-input XOR gates. Hence, multiplication plays a pivotal role in performing finite field arithmetic operations, and it is crucial to employ efficient design in implementation of finite field multiplier to meet the need of performance challenges of state-of-the-art applications based on finite field arithmetic. In the literature, the elements of a finite field GF(2 m ) can be represented using various basis including dual basis (DB) [10], [19], polynomial basis (PB) [1],[2], [5],[7], normal basis (NB) [3], [6], [19] and redundant basis (RB) [17], [28] which each one has its own distinct features [9]. The efficiency of finite field multiplication highly depends on the representation of the elements. While, DB bases multipliers require less chip area for VLSI implementation, they usually employ additional modules for basis conversions [12]. The PB multiplier is probably the most well-known one which does not require basis conversion. Moreover, considering its regularity and simplicity, it has gained more attention for hardware implementation. As compared to the other two bases, NB basis multiplier is favored in performing squaring in finite field that makes it more suitable in performing division, multiplicative inversion, and exponentiation operations. RB bases not only suggests free squaring operation but also eliminates the modulo reduction, whereas it requires more bits to represent field elements where GF(2 m ) in a cyclotomic field of higher order is embedded hat can result in more hardware complexity [28]. An extension Galois field can be completely constructed by an m th degree monic polynomial over GF(2) called as an irreducible polynomial of the field
GF(2 m ) . All elements of the extension field GF ( m ) in polynomial basis can be represented as polynomials over GF ( ) of degree less than m [24] . Irreducible polynomial plays a central role in arithmetic operations. A polynomial over GF ( p ) of degree m is irreducible if it is not divisible by any polynomial over GF ( p ) of degree less than m [24] . Multipliers can be performed using various classes of irreducible polynomials including generic, trinomials, pentanomials, equally spaced s of PB multiplication can be developed based on the two different categories: parallel and serial computations. All output bits of the multiplication in parallel implementation generate in a single clock which lead to high throughput. However, for achieving low space complexity, bit-level serial computations are used. In bit-level serial multiplication schemes, space complexity is reduced at the expense of increasing the number of clock cycles required for generating the m output bits (computational latency) to m clock cycles. The proposed architecture in this work targets resource constrained applications, and hence, is for a bit-level PB multiplier. In the bit-level category, the architectures can be bit-serial or bit-parallel. While in bit-serial architectures, input/output enters/generates either in parallel, or serially, in bit-parallel architectures, all inputs and outputs are parallel. Moreover, different types of implementations such as sequential [25, 29, 31], parallel [6-9], [14-16], [28], systolic [7-8], [21-22], [33-34] and semi-systolic [1, 30] have been proposed in literature. Due to the fact that the output of a sequential structure is available after m clock cycles for GF(2 m ) multiplication, it has longer execution time at the advantage of less hardware complexity. While parallel structures concurrently generate output in a single clock cycle, they consumes excessive hardware. The outstanding features of systolic structures are regularity, modularity, concurrency and local interconnections which are more suitable for VLSI implementation. Systolic structures ncrease throughput, although their area and latency are usually very large. A group of bits called a digit can be processed at a time in digit-level [23], [28], [32] architectures. Digit-level architectures can achieve better area and time complexity and thus are practical for resource-constrained devices such as smart phones. Numerous techniques for polynomial basis multiplication over GF(2 m ) have been proposed to reduce area overhead and speed up computations. Heyssam et al. [13] presented a new bit-level serial PB multiplication scheme which generates its output bits in parallel after m clock cycles without requiring any preloading of the inputs. Gebali et al. [12] proposed a novel scalable serial multiplier architecture for PB multiplication over GF(2 m ) using progressive product reduction (PPR) technique. This architecture was extracted by converting the GF(2 m ) multiplication into an iterative algorithm using systematic nonlinear technique that combines affine and nonlinear processing element (PE) scheduling and assignment of computations to processors. In [5], a new class of pentanomials over F presented which the standard Karatsuba algorithm was used in the multiplication process, and shown that in the reduction process no ANDs are required. In [20] a serial-output bit-serial multiplier structure for general irreducible polynomials has been proposed which required m clock cycles for the latency, the proposed serial-output bit-serial multiplier has the latency of one clock cycle. By connecting the output of the proposed multiplier to the serial-input of the LSB-first multiplier, one can obtain a hybrid structure which performs two multiplications together. Fan et al. [15] presented a new non-pipelined bit-parallel-shifted polynomial basis multiplier for GF(2 m ). Their main contribution of the multiplier was that its gate delay was equal to
XORAND
TnT for certain irreducible trinomials. In [18] proposed the architecture of a polynomial basis multiplier which supported polynomial basis multiplication based on irreducible polynomials with m ≥ kt + . In terms of timing performance, the proposed architecture had a latency of m/4 . In [4] proposed a versatile polynomial basis multiplier which utilized a row of tri-state buffers and some control ignals along with the (MSB)-first multiplier with a lower power dissipation property in a particular architecture. In [26] presented three small classes of irreducible polynomials for low-complexity bit-parallel multipliers. It has been shown that the proposed multiplier has lower complexities than the ones based on pentanomials. Mathe et al. [25] presented a sequential polynomial basis multiplier for generic irreducible polynomials with a latency of m clock cycles. This architecture is designed to take one operand in parallel and another operand serially during computation. It is a versatile multiplier in the view that it is applicable to any irreducible polynomial over GF(2 m ) . In [29] proposed a modified algorithm with interleaved modular reduction multiplication approach. The modification involved formulating the algorithm employing more efficient logical relations which uses logical NAND and XNOR gates in multiplier design. Motivated by the modified algorithm in [29], in this paper, we propose a new modified algorithm based on method of serial interleaved multiplication [11] that is available in the literature. For more area-efficient implementation, the proposed formulated algorithm eliminates the logical XOR (XNOR) gates using a well-known logical relation. Due to the fact that NAND gate has lower area and time complexities as compared to other gate complexities such as AND or XOR/XNOR, employing this efficient logical relations can lead to hardware efficiency and lower critical path delay [27,28]. Analysis shows that the proposed multiplier achieves low-area compared to the majority of similar multipliers available in the literature and it is comparable with the best existing area-efficient multiplier, for m = 163. Hence, our multiplier would be preferable in situations where space complexity and saving energy are more relevant than time complexity. The organization of this paper is described as follows: In Section 2, we provide notations and preliminaries of finite field multiplication in GF(2 m ) using the polynomial basis. In Section 3, we derive formulations for the proposed polynomial basis multiplier structure. The rchitectural complexity and the performance comparison are discussed in Section 4 followed by the details of VLSI implementations of five practical field size multipliers in Section 5. The conclusion remarks are given in Section 6. NOTATIONS AND PRELIMINARIES
As it is noted before, this paper deals with finite field elements represented in polynomial (or standard or canonical) basis. An extension Galois field can be completely constructed by an m th degree monic polynomial of the form,
11 1)( jxjfmjmxXf over GF(2 m ) called as an irreducible polynomial. A polynomial over GF(2 m ) of degree m is irreducible if it is not divisible by any polynomial over GF(2 m ) of degree less than m [24]. All elements of the extension field GF ( m ) in polynomial basis can be represented as polynomials over GF ( ) of degree less than m [24] . Let )2( mGF be a root be of the irreducible polynomial for GF(2 m ). Then, the polynomial basis is obtained by the following set of m elements as )1....,.........,2,,1( m , and f . Standard basis multiplication in GF(2 m ) is typically performed in two steps: ordinary polynomial multiplication and reduction modulo the field polynomial. Let mxmaxaaxA and mxmbxbbxB be two field elements and mxmcxccxC their product module where all )2(,, GFjcjbja . Then the finite field multiplication )()()( xBxAC is accomplished by calculating following equation: (1) )(mod)(10 )(mod))1.......10()1........10(( )(mod)()()( xfxAjxmj jb xfmxmbxbbmxmaxaa xfxBxAxC he product )()( xBxA need to be first calculated, resulting in a polynomial of degree at most 2 m -
2. In a second stage the modular reduction is performed which resulting in the polynomial C ( x ) of degree at most m – PROPOSED LOW-COMPLEXITY BIT-PARALLEL MULTIPLIERS IN GF(2 m ) The each )( xAixjb in equation (1) can be performed recursively as follow: (2) It is obvious from the equation (2) that the summation is performed in m iterations for i= 0, 1, …,m-1 . It should also be noted that xA(x) is calculated by the modular reduction step and then would be multiplied by b i . So by interleaving f(x) throughout the equation (2) it can be expressed as follow: )(0)(mod)(1...)(mod))(3)(mod))(2)(mod)(1...(()( xAbxfxxAbxfxxAmbxfxxAmbxfxxAmbxC (3) According to equation (3) the term of )(1 xAmb should be carried out (m-1) times. If it is considered as P k-1 (x), module reduction will be performed on )(1 xkxp which )(1 xkp is partial-product polynomial that generated in thk )1( iteration. It can be expressed as a polynomial of degree (m-1) as follow: (4) The equation (4) can be evaluated for each mk , as follows, where k is the iteration count, xpkfor )(1)()1(:2 xAmbxPkfor Then for evaluating )(mod)(1 xfxxP , at the first, the product polynomial xxP )(1 should be carried out and after that be reduced using )( xf as follow: )(mod))(0)(1...))(3))(2)(1...(()( xfxAbxxAbxxAmbxxAmbxxAmbxC )1(0)1(1...2)1( 21)1( 1)(1 ktktmkmtmkmtxkp (5) From the equation (5) it can be seen that the modular reduction is reduced to the summation of fmp k )1( and ixP . . The ixP . is computed by left shifting P by i times; for all mi that can be expressed as follow: (6) Due to the left-most term equation (2) the product term )(2 xAmb should be added to equation (3). The result of summation can be represented by )()2( xP for the next iteration. So it is obvious that the equation )()()1()( xAkmbxkPxKP should be repeated m times to be carried out the final multiplication result of equation (2). It is worth pointing out that the summation is computed using XOR operation. So the equation (6) can be written as follow: (7) For more efficient implementation, the logical XOR operation can be converted to an equivalent circuit that only uses NAND, since for any two logic variables a and b , we can find that (a XOR b) = ((a NAND (a NAND b)) NAND ((a NAND b) NAND b)) as explained bellow: )(].)()()1([])()()1().[()1( )(.)()1()().()1( ))(.()()1()( xAkmbxAkmbxkPxAkmbxkPxkP xAkmbxkPxAkmbxkP xAkmbxkPxKP (8) )1( 1))1(01)1( 1(...2))1( 32)1( 1(1))1( 211( )011...2211mod())1(0)1(1...1)1( 21( )(mod)()1( )1( )1( mpxpfmpmxmpmfmpmxmpmfmp fxfmxmfmxmfpxpmxmpmxmp xfxxP ixPfkmpxKP .)1( 1)()1( )()()1()( xAkmbxkPxKP Where denotes logical NAND operation. This equation reveals that to compute desired multiplication result, the logical NAND operation can be used instead of logical XOR operation. Compared to conventional equation, this method can significantly improve hardware complexity. Fig. 1 describes the details of the proposed formula as a flowchart. Fig. 1:
The flowchart of proposed multiplication with new formulations
Start )0,...,1( )0,...,1( )0,...,1( fmfmf bmbmb amama T Initial mK ))(()]()()1([)]()()1([)()1()( .)1(].)1()1( 1[.)1()1( 1[)1( 1)()1( ] xAkmbxAkmbxkPxAkmbxkPxkPxKP ixkPixkPfkmpixkPfkmpkmpxKP kk KPBAC
End ))(()]()()1([)]()()1([)()1( ))(].()(.).()1([].)(.).()1()[()1( xAkmbxAkmbxkPxAkmbxkPxkP xAkmbxAkmbxkPxAkmbxkPxkP yes no . Proposed bit-serial PB Multiplier Design
The architecture of proposed sequential PB multiplier in Fig. 2 consists of two main blocks G and H, two m -bit registers, and a shift left (SL) block. Assume that the field polynomials to be multiplied are A and B, and field irreducible polynomial is f . All these three polynomials A; B, and f , using their vector representations. So the operands are considered as a m- bit vector for multiplier architecture. Block G computes the module reduction of equation (3) and block H performs summation of partial products that can be seen in equation (3). The left-side m -bit register (Reg1) is initially loaded with operand A, and the left-side m -bit register (Reg2) is initialized to zero which stores the summation of previous reduction and m -bit partial product P K during K th cycle. At the first clock the most significant bit of the operand B ( b m-1 ) enters block H to be multiplied with operand A, likewise the remaining bits of B will enter serially in order to the accumulation of partial products result can be carried out. SL block computes the term P (k-1) x i which is the left shift P (k-1) , i times which just consists of hardware-free re-wiring . The final multiplication result is given by Reg2 after m clock cycles. Fig. 3 shows the hardware details for the proposed design. Both blocks are composed of four levels of logic gates which is made of an array of m NAND gates to implement the NAND operations suggested in equation (8).
Fig. 2 : Proposed structure for bit-serial sequential multiplier
Fig. 3:
Logic diagram of module G and H
In this section, we obtain the space and time complexities of the proposed parallel output bit-serial (POBS) PB multiplier. The area and delay complexities of the proposed design can be readily calculated from Fig. 1. Table 1 compares the proposed design with some of its competitors [8], [22], [25], [29], [33], in terms of area (AND gates, NAND gates, XOR gates, multiplexers, and flip-flops), latency, and critical path delay.
Proposition1 : For the finite field
GF(2 m ) generated by the general irreducible f(x) , the proposed POBS PB multiplier (Fig. 2) requires register, AND gate and NAND gates.
Proof : Each block (See Fig. 2) contains three levels of an array of m -NAND gates and one level of an array of m -AND gates. Moreover the proposed architecture needs 2 registers to eeps m -bit operand A and the final m -bit product result respectively. Hence, the proposed multiplier architecture requires NAND gates, AND gates and registers. The time complexities of the multiplier are determined by three factors: latency, the number of clock cycles required for whole multiplication, and the critical path delay. Let us define the latency as the number of clock cycles needed that the first bit of the output be available. Based on this definition, it is clear that the latency of the proposed SOBS multiplier is m clock cycles. The critical path delay, which is the longest path from the registers to the output C ; determines the maximum operating frequency. Proposition2:
Let T A ; T X ; T N ; T XN ; T FF ; T M and T tsb denotes the delays of -input AND gate, -input XOR gate, -input NAND gate, -input XNOR gate, D flip-flop, -bit multiplexer, and tri-state buffer, respectively. Then, the critical path delay and latency of the proposed POBS PB multiplier (Fig. 1) are at most N and m clock cycle respectively. Proof : The critical path delay of the each block is determined by the maximum delay from input to output. As it can be observed from the Fig. 1 each block has three levels that consists of -input NAND gates, -input NAND gates and input AND gates. So the delay of each block is max(T A , T N )+ N Therefore, the total delay to generate C is equal to 2T A + 4T N . Besides the multiplication of two m -bit elements is computed over m iterations. Therefore, the resultant latency is m clock cycles and the proof is complete. From the results shown Table 1, we cannot claim that our proposed multiplier is the best available sequential multiplier, but it has comparable results with the best sequential one [29]. Although the overall structure of the two architectures might seem similar, there is one important difference between them which is implementation of proposed PB multiplier without utilizing XOR gates. In other words, in terms of hardware complexity the proposed architecture came in the second place which is around %11.53 more than [29]. It can be observed from the Table 1 that multipliers [8, 22, 33] have more number of registers and long latency compared o the proposed multiplier. Moreover, the multiplier in [25] have more hardware at the advantage of less critical path in comparison to our proposed architecture. Table 1 . Comparison of space and time complexities between different finite field multipliers
In order to enable a better comparison, the area and delay complexities of the multipliers listed in Table 1 have been calculated and tabulated in Table 2 as a case study. National Institute of Standards and Technology (NIST) has recommended five binary fields for Elliptic curve cryptographic applications : (GF ) , (GF ) , (GF ) , (GF ) , and (GF ) . In all the calculations made for Table 2, the field size was selected as m = 163 [19]. It should be noted that CMOS065LP CMOS VLSI technology-based standard cell library from STMicroelectronics is used to estimate area and time complexities of the gates. The area complexities in terms of the number of transistors for a 2-input NAND gate, a 3-input NAND gate, a 2-input AND gate, a 2-input XOR gate, a 2-input XNOR gate, a 2–1 MUX and a D flip-flop with set/reset capabilities are 4, 8, 6, 12, 12, 12, and 30 transistors, respectively. Moreover, the delays of a 2-input NAND gate, a 2-input AND gate, a 2-input XOR gate, a 2-input XNOR gate, a 2-1 MUX and a D flip flop with set/reset are 0.02, 0.03, 0.04, 0.04, 0.03, and 0.08 ns, Multiplier [33]
0 2m m T A +T X [22] + 2m 0 2m + 3m 0 3m + 4m m/2 + 1 T A + 2T X [8] m
0 m + m -1 0 3m + 2m-2 2m - 1 T A +T X [25]
2m 0 2m 0 3m m T A +T X [29]
0 2m 2m 0 2m m T N +2T XN Proposed
2m 8m 0 0 2m m 2T A +4T N espectively. It is observed that the proposed multiplier requires the least number of transistors as compared to the majority of multipliers available in the literature. In the design of digit-level finite field multipliers, there is always a trade-off between delay and area costs as two important design factors and reducing one them generally results in an increase in the other one. To achieve a fair comparison, the area-delay product of the multipliers has been calculated and listed in the rightmost column of the Table 2. As can be seen, the proposed architecture shows much lower transistor counts than all the existing multipliers listed in the Table 2 except the multiplier [29]. The space complexity of proposed multiplier is 99.96%, 99.91%, 97.97% and 17.46% lower than the multipliers [8], [22], [25], [33], respectively. Furthermore, in comparison with multipliers [8], [22], [33], proposed architecture offers 99.93%, 99.89%, and 96.02% area-delay improvement (ADP). As seen from this table, the proposed SOBS multiplier has the lowest hardware complexity at the expense of longer critical path and more delay. Table 2 . Comparison between different finite field multipliers for NIST recommended field
GF ( ) Table 3 illustrates the comparison results of the total transistor count for the five irreducible polynomials configurations recommended by the NIST ( m = 163, 233, 283, 409 Multiplier m = 163 m = 233 m = 283 m = 409 m = 571 [22] 1285418 2620318 3861818 8054846 15685370 [8] 906910 1850930 2729230 5696530 11097934 [39] 2391210 4886010 7208010 15055290 29343690 [25] 20538 29358 35658 51534 71946 [29] 14996 21436 26036 37628 52532
Proposed 16952 24232 29432 42536 59384 nd 571). The number of transistors required by the proposed multiplier rises linearly with the increase of m similar to the multipliers in [25], and [29]. On the other hand, the total transistor requirements for circuit implementations of the architectures in [8], [22], [33] would grow exponentially when the order of the irreducible polynomial increase. For better understanding Fig. 4 depicts the result of this table. Furthermore, the proposed multiplier can considerably reduce space complexity for higher order finite fields. Table 3 . Comparison of total transistor count for the five irreducible polynomials recommended by (NIST)
Multiplier
Critical path (ns) Latency (clock cycles) Delay (ns) ) Reduction in Area Reduction in ADP [33] 0.07 818 11.41 48176928 5496 99.96 99.93 [22] 0.11 205 17.93 21146118 3791 99.91 99.89 [8] 0.07 817 11.41 837629 95.5 97.97 96.02 [25] 0.07 163 11.41 20538 2.3 17.46 - [29] 0.10 163 16.3 14996 2.4 - - Proposed 0.14 163 22.82 16952 3.8 - -
Figu. 4 : Comparison of space complexity of the proposed multiplier with some selected multipliers Conclusion
We proposed a low complexity serial-in parallel-out multiplication scheme over generic field polynomials for the elements of GF (2 m ), based on the PB representation . With combining the applied recursive formula of [11] and this study, an area-efficient architecture has been obtained that shows it is possible to design SPB bit parallel multipliers with logical NAND gate for the five binary fields recommended by NIST which can lead to regularity and modularity of VLSI implementation of finite field multipliers. We note that in our implementation no XORs are required. The proved complexity analysis of the proposed multiplier in this paper, suggests that its space complexity is as good as or possibly better to the ones already proposed which is desirable in constrained applications, such as smart cards, handhelds, and implantable medical devices. eferences
1. A. Ibrahim, F. Gebali, “Low power semi-systolic architectures for polynomial-basis multiplication over GF(2 m ) using progressive multiplier reduction”, J . Signal. Process Sys.
82 (3) (2016) 331–343. 2. A. Reyhani-Masoleh, “A New Bit-Serial Architecture for Field Multiplication Using Polynomial Bases”, th International workshop, wasington, D. C., USA, Agust 10-13, 2008.
3. A. Reyhani-Masoleh, “Efficient algorithms and architectures for field multiplication using Gaussian normal bases,”
IEEE Trans. Comput. , vol. 55, no. 1, pp. 34–47, Jan. 2006. 4. A. Zakerolhosseini , M. Nikooghadam Nikoughadam,. (2013). Low-power and high-speed design of a versatile bit-serial multiplier in finite fields GF(2 m ). Integration, the VLSI Journal. 46. 211–217. 10.1016/j.vlsi.2012.03.001. 5. B. Gustavo, C, Ricardo, P. Daniel, (2018). “A new class of irreducible pentanomials for polynomial-based multipliers in binary fields”, Journal of Cryptographic Engineering. 10.1007/s13389-018-0197-6. 6. C. Koc, B. Sunar, “Low-complexity bit-parallel canonical and normal basis multipliers for a class of finite fields”, IEEE Trans. Comput . 47 (3) (1998) 353–356. 7. C. Lee, E. Lu, J. Lee, “Bit-parallel systolic multipliers for GF(2 m ) fields defined by all-one and equally spaced polynomials”, IEEE Trans. Comput.
50 (5) (2001) 385–393. 8. C.Y. Lee, “Low complexity bit-parallel systolic multiplier over GF(2 m ) using irreducible trinomials,” IEE Proceedings on Computers and Digital Techniques , vol. 150, no. 1, pp. 39-42, February, 2003.
9. Cilardo, “Fast parallel GF(2 m ) polynomial multiplication for all degrees,” IEEE Trans. Comput . 62 (5) (2013) 929–943. 10. D. Jungnickel, A. J. Menezes, and S. A. Vanstone, “On the number of self-dual bases of
GF(2 m ) over GF(q) ,” Proc. Amer. Math. Soc. , vol. 109, no. 1, pp. 23–29, 1990. B. Lee, String field theory,
J. Comput. Syst. Sci. (1983). 11. E.D. Mastrovito, “VLSI Architectures for Computations in Galois Fields,” PhD thesis, Linkoping University , Linkoping, Sweden, 1991.
12. F. Gebali, A. Ibrahim, “Efficient Scalable Serial Multiplier Over GF(2 m ) Based on Trinomial,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems m ) Multiplication Using Polynomial Basis,”
14. H. Fan and Y. Dai, “Fast bit-parallel GF(2 m ) multiplier for all trinomials,” IEEE Trans. Comput. , vol. 54, no. 4, pp. 485–490, Apr. 2005 . 15. H. Fan , M. Anwarul Hasan , “
Fast Bit Parallel-Shifted Polynomial Basis Multipliers in GF(2 m ) ”, IEEE Trans. on Circuits and Systems 53-I(12) : 2606-2615 (2006).
16. H. Wu, “Bit-Parallel Polynomial Basis Multiplier for New Classes of Finite Fields”,
IEEE Transactions on Computers , September 2008, 57(8):1023 – 1031. 17. H. Wu, M. A. Hasan, I. F. Blake, and S. Gao, “Finite field multiplier using redundant representation,”
IEEE Trans. Comput. , vol. 51, no. 11, pp. 1306–1316, Nov. 2002. 18. Huong Ho, “Design and Implementation of a Polynomial Basis Multiplier Architecture Over GF(2 m ),” Journal of Signal Processing Systems , June 2013 , 75(3): 203–208. 19. I. Hsu, et al., “A comparison of VLSI architecture of finite field multipliers using dual, normal, or standard bases”, IEEE Trans. Comput. 37 (6) (1998) 735–739. 20. J. Imaa, “Low latency GF(2 m ) Polynomial basis Multiplier,” IEEE Trans. Circuits Syst . I. Regul. Pap. 58 (5) (2011) 935–946. 1. J. Xie, P. Meher, Z. Mao, “Low-latency high-throughput systolic multipliers over GF(2 m ) for NIST recommended pentanomials”, IEEE Trans. Circuits Syst. I. Regul . Pap. 62 (3) (2015) 881–890. 22. K.W. Kim, and J.C. Jeon, “Polynomial Basis Multiplier Using Cellular Systolic Architecture,”
IETE Journal of Research , vol. 60, no. 2, pp. 194-199, June 2014.
23. L. Song, K. Parhi, "Low-energy digit-serial/parallel finite field multipliers,"
J. Signal. Process Sys . 19 (2) (1998) 149–166. 24. Lin, S., and Costello, D.J.: “ Error control coding: fundamentals and applications ” (Prentice-Hall Inc., 2004). 25. Mathe SE, Boppana L, “Design and implementation of a sequential polynomial basis multiplier over GF(2 m )”, KSII Trans Internet Inform Syst 2017;11 (5):2680–700.
26. Menezes AJ, Vanstone SA. “Elliptic curve cryptosystems and their implementation,”
J Cryptol 1993;6(4):209–24.
27. P. K. Meher, “On efficient implementation of accumulation in finite field over GF(2 m ) and its applications,” IEEE Trans. Very Large ScaleIntegr. (VLSI) Syst. , vol. 17, no. 4, pp. 541–550, Apr. 2009.
28. PH. Namin, Muscedere R, Ahmadi M. “Digit-level serial-in parallel-out multiplier using redundant representation for a class of finite fields”.
IEEE Trans Very Large Scale Integr (VLSI) Syst 2017;25(5):1632–43.
29. Pillutla, Siva, Boppana, Lakshmi “ An Area-Efficient Bit-Serial Sequential Polynomial Basis Finite Field GF(2 m ) Multiplier ” AEU - International Journal of Electronics and Communications. 114. 153017, November 2019.
30. S. Jain, L. Song, K. Parhi, “Efficient semi-systolic architectures for finite-field arithmetic”,
IEEE Trans. VLSI Syst . 6 (1) (1998) 101–113. 31. S. Mathe, L. Boppana, “Design and implementation of a sequential polynomial basis multiplier over GF(2 m ),” KSII Trans. Internet Inf. Syst . 11 (5) (2017) 2680–2700. 32. T.-Y. Lee, M.-J. Liu, C.-H. Huang, C.-C. Fan, C.-C. Tsai, H. Wu, "Design of a digit-serial multiplier over GF(2 m ) using a karatsuba algorithm", J. Chin. Inst. Eng.
42 (7) (2019) 602–612. 33. W.C. Tsai, and S.J. Wang, “Two systolic architectures for multiplication in GF(2 m ),” IEEE Proceedings on Computers and Digital Techniques , vol. 147, no. 6, pp. 375-382, December, 2000.
34. Xie J, jun HJ, Meher PK. Low latency systolic Montgomery multiplier for finite field GF(2m) based on pentanomials. IEEE Trans Very Large Scale Integr (VLSI) Syst 2012;21:385.34. Xie J, jun HJ, Meher PK. Low latency systolic Montgomery multiplier for finite field GF(2m) based on pentanomials. IEEE Trans Very Large Scale Integr (VLSI) Syst 2012;21:385.