Preprint
Article

Area-Time Efficient High-Radix Modular Inversion Algorithm and Hardware Implementation for ECC over Prime Fields

Altmetrics

Downloads

67

Views

178

Comments

0

A peer-reviewed article of this preprint also exists.

This version is not peer-reviewed

Submitted:

12 September 2024

Posted:

12 September 2024

You are already at the latest version

Alerts
Abstract
Modular inversion on large operands is a time-consuming calculation used in elliptic curve cryptography. Its hardware implementation requires extensive hardware resources such as lookup tables and registers. We investigate state-of-the-art modular inversion algorithms and evaluate the performance and cost of the algorithms and their hardware implementations. We then propose a high-radix modular inversion algorithm aimed at short execution time and low hardware cost. We present a detailed radix-8 hardware implementation based on 256-bit primes in Verilog HDL and compare its cost performance with other implementations. Our implementation on the Altera Cyclone V FPGA chip uses 1227 ALMs (Adaptive logic modules) and 1037 registers. The modular inversion calculation takes 3.67 microseconds. The AT (Area time) factor is 8.30, outperforming other implementations. We also present an implementation of elliptic curve cryptography using the proposed radix-8 modular inversion algorithm. The implementation results also show that our modular inversion algorithm is more efficient in area time than other algorithms.
Keywords: 
Subject: Computer Science and Mathematics  -   Computer Science

1. Introduction

Modular inversion is an important computation in elliptic curve cryptography (ECC). ECC provides a secure key agreement between two parties over an insecure network. It calculates points on an elliptic curve over a finite field (such as a field of prime numbers) based on point addition (PA) and point doubling (PD) computations. In affine coordinates, PA and PD must calculate the slope of a line. Such calculations involve costly modular inversions. In projective or Jacobian coordinates, PA and PD do not require such calculations, but a modular inversion is still required to transform the points to affine coordinates to obtain the same key for the two parties.
Given a prime number m, the inverse r of a number a with a < m is defined as r = a 1 m o d m . There are mainly two popular methods for calculating modular inversion:
(1)
Extended Euclidean Algorithm (EEA) without using divisions.
(2)
Using Fermat’s Little Theorem a m 1 = 1 m o d m  [1]: r = a m 2 m o d m = a 1 m o d m .
We will see that the method using Fermat’s Little Theorem takes longer time and requires more registers than EEA. Therefore, we will focus our design on using the EEA.
The EEA inherently needs divisions. It calculates the largest integer quotient and calculates the remainder based on the quotient. The divisions can be replaced by addition, subtraction, and shift operations. For simplicity, we will also refer to EEA which does not use divisions as EEA.
For calculating r = a 1 m o d m , EEA first initializes u , v , x , y with a , m , 1 , 0 , respectively. Then EEA repeats calculations containing only addition, subtraction, and shift operations on u , v , x , y until u = 1 or v = 1 . Finally, the modular inversion result is available by adjusting x or y, corresponding to u = 1 or v = 1 . A modular inversion algorithm is said to be fast if u or v reaches 1 quickly.
The widely used modular inversion algorithm is Algorithm 2.22, proposed by Hankerson, Menezes, and Vanstone [2]. It repeatedly shifts u or v to the right when u or v is even. Correspondingly, x is also shifted to the right with the shift of u; y is also shifted to the right with the shift of v. Note that when x or y is odd, m will be added before shift. This guarantees that the value to be right-shifted is even, since the prime m is odd. Next, if u v , u and x will be replaced by u v and x y , respectively. Otherwise, v and y will be replaced by v u and y x , respectively. Finally, the result is x m o d m if u = 1 and y m o d m otherwise. Hossain and Kong [3] revised Algorithm 2.22 by adding m to x or y if it is negative. This ensures that x and y are non-negative. Daly, Marnane, Kerins, and Popovici [4] revised Algorithm 2.22 by dividing u v or v u by two because the subtraction result is even (both u and v are odd before the subtraction). Correspondingly, x y or y x needs also to be divided by two: If x y or y x is odd, m is added before the division. Division by two is done by shifting one bit to the right. Mrabet, El-Mrabet, Bouallegue, Mesnager, and Machhout proposed a modular inversion algorithm [5] with u + v . Instead of u v or v u , as Algorithm 2.22 does, they perform u + v for new u or v. This operation slows down the speed at which u or v reaches 1, increasing execution time. Chen and Qin proposed a modular inversion algorithm [6] that uses only adders. Subtractions are performed by addition with inversion and addition by 1. Choi, Lee, Kong, and Kim proposed a modular inversion algorithm [7] that replaces the repeated shift of u or v and the corresponding shift of x or y in Algorithm 2.22 by a selection of u , x or 0 , 0 , or a selection of v , y or 0 , 0 , based on the even/odd of v or u. This simplifies the circuit by replacing adders with multiplexers, reducing the circuit delay. Also, they use v and y , instead of v and y, during the calculation. This merges u v and v u into u + v and merges x y and y x into x + y , reducing the circuit cost. Mixed radix-4 modular inversion algorithms are investigated in  [7,8,9,10]. If u or v is divisible by four, u or v is shifted to the right by two bits. Otherwise, if u or v is even (divisible by two), u or v is shifted to the right by one bit. Otherwise (both u and v are odd), u v or v u is shifted to the right by one bit and assigned to u or v. Correspondingly, x or y is adjusted by adding m , m, or 2 m and shifted to the right by two bits or one bit. [8] proposed a radix-4 modular inversion algorithm that uses a sequential condition checking for the calculations of u, v, x, and y. [9] implemented the SM2 ECC protocol. The iterations of the modular inversion are controlled by the bit counter ρ , resulting in unnecessary iterations. Using u and v to control the iterations will finish the calculation quickly. [10] gave a radix-4 version of Algorithm 2.22. Dong, Zhang, and Gao proposed a mixed radix-8 modular inversion algorithm [11] that uses extensive hardware resources.
The AT (Area time) factor is often used for comparisons between implementations. It is defined as the execution time in milliseconds multiplied by the required hardware resources consisting of registers and lookup tables or ALMs (Adaptive logic modules).
In this paper, we implement and evaluate all the algorithms mentioned above. We then propose a mixed radix-8 modular inversion algorithm aimed at short execution time and small hardware resources. We give its detailed hardware implementation in Verilog HDL based on 256-bit primes. It has lower hardware costs for ALMs and registers and has better performance than other algorithms. The implementation on the Altera Cyclone V FPGA chip uses 1227 ALMs and 1037 registers and takes 3.67 microseconds for the modular inversion computation. It achieves an AT factor of 8.30, lower than all other implementations. We also implement ECC using different modular inversion algorithms and compare their cost performance.
The rest of the paper is organized as follows. Section 2 introduces ECC and modular inversion algorithms. Instead of pseudocodes, all modular inversion algorithms are provided in Python code. Section 3 proposes a mixed radix-8 modular inversion algorithm, gives its hardware implementation in Verilog HDL, and compares its cost performance with other algorithms. Section 4 provides an ECC implementation using the proposed algorithm and compares its cost performance with the ECC implementations using other modular inversion algorithms. And Section 5 concludes the paper.

2. ECC and Modular Inversion Algorithms

This section briefly introduces elliptic curve cryptography and modular inversion algorithms based on the extended Euclidean algorithm.

2.1. Elliptic Curve Cryptography

ECC [12,13] relies on the fact that scalar point multiplication Q = d P can be computed, but it is almost impossible to compute d given only the original point P and the point of the product Q. An ECC over the finite field of an n-bit prime number m can use the equation:
y 2 = x 3 + a x + b m o d m
For example, Secp256k1 [14] elliptic curve used in Ethereum Blockchain uses a 256-bit m = 2 256 2 32 2 9 2 8 2 7 2 6 2 4 1 . Secp256k1 defines y 2 = x 3 + a x + b = x 3 + 7 and gives a point P = [ x , y ] on the elliptic curve as follows.
  a = 0x0000000000000000000000000000000000000000000000000000000000000000
  b = 0x0000000000000000000000000000000000000000000000000000000000000007
  m = 0xfffffffffffffffffffffffffffffffffffffffffffffffffffffffefffffc2f
  x = 0x79be667ef9dcbbac55a06295ce870b07029bfcdb2dce28d959f2815b16f81798
  y = 0x483ada7726a3c4655da4fbfc0e1108a8fd17b448a68554199c47d08ffb10d4b8
The elliptic curve Diffie–Hellman (ECDH) key exchange protocol can be used by two parties, Alice and Bob for example, to establish a shared secret key over an insecure network [14,15]. The ECDH protocol is shown in Table 1.
Because Q a b = d a Q b = d a d b P , Q b a = d b Q a = d b d a P , and d a d b = d b d a , we have Q b a = Q a b . Below is an ECDH key exchange example using Secp256k1. We can see that the two parties, Alice and Bob, have the same shared secret key (Qabx = Qbax).
  Alice keeps d a secret and exposes Q a = d a P :
  da   = 0x0e26233af432c34fa2523e72b64ebcf5abb4f7cb07d8f25909160a50f584f461
  Qax  = 0x672751a7b8ec03f4610c364d3776fc5200f401aa26074b10d33c2c202fc63330
  Qay  = 0xdbf4a0758d81414d2ee049a07c3d8288428235dec9dbfb7fb15d6f478cb4ccf6
Bob keeps d b secret and exposes Q b = d b P :
  db   = 0xbbb481621f91b8e9225649109919a7962cec4225c981b6b68b4f8d8a356cc6d9
  Qbx  = 0x09aff71612598a88d5299ae81e3161a2c9045f343315da8cfabb4dc55253041f
  Qby  = 0x736b118369583448c0b77eed11ac26af31450d406130d82a1bc56cc3a1835b1d
Alice obtains Q b and calculates Q a b = d a Q b :
  Qabx = 0x0f675b3195fd6a6f06c9a6960ff2a4f647f637f513c8bb7bedc8a89311f62df2
  Qaby = 0x79de922da4db277fe0c674277243c1dfd0653913d037fb07e955c3cdf21e69c7
Bob obtains Q a and calculates Q b a = d b Q a :
  Qbax = 0x0f675b3195fd6a6f06c9a6960ff2a4f647f637f513c8bb7bedc8a89311f62df2
  Qbay = 0x79de922da4db277fe0c674277243c1dfd0653913d037fb07e955c3cdf21e69c7
Now, Alice and Bob have the same secret key (Qabx = Qbax). They can use symmetric-key cryptography for subsequent communications. A third party, Eve for example, knows y 2 = x 3 + a x + b m o d m , P, Q a , and Q b , but cannot calculate the same secret key.

2.2. Point Addition and Point Doubling

Scalar point multiplication Q = d P calls point addition (PA) and point doubling (PD).

2.2.1. Point Addition

Given P = [ x p , y p ] and Q = [ x q , y q ] , the formulas for point addition R = [ x r , y r ] = P + Q on elliptic curve y 2 = x 3 + a x + b m o d m are shown as follows, where λ is the slope of the line through points P and Q.
λ = y q y p x q x p m o d m x r = ( λ 2 x p x q ) m o d m y r = ( λ ( x p x r ) y p ) m o d m
The point at infinity, denoted O , is included in the group of elliptic curves and is defined as P + ( P ) = O for Q = P . By this definition, P + O = P . In our implementation, O is represented as [ 1 , 1 ] . In the case of P = O , R = P + Q = O + Q = Q . In the case of Q = O , R = P + Q = P + O = P . We give the point addition R = P + Q algorithm over the finite field of F m in Algorithm 1. In the case of Q = P , R = P + Q = P + ( P ) = O (line 5 in the algorithm). In the case of Q = P , R = P + Q = P + P = 2 P , we perform the point doubling R = 2 P (line 6 in the algorithm).
Algorithm 1 PA (P, Q, m, a) (Point Addition in Affine Coordinates).
inputs: Points P = [ P x , P y ] and Q = [ Q x , Q y ] ; m and a in y 2 = x 3 + a x + b   mod   m
output: R = P + Q = [ R x , R y ] = [ x r , y r ]
begin
1 x p = P x , y p = P y , x q = Q x , y q = Q y , O = [ 1 , 1 ]
2 if  P = O  return Q     /* O + Q = Q */
3 if  Q = O  return P      /* P + O = P */
4 if  x p = x q
5 if  ( y p + y q )   mod   m = 0  return  O     /* P + ( P ) = O */
6 else return PD (P, m, a)      /* P + P = 2 P */
7 λ = ( ( y q y p ) / ( x q x p ) )   mod   m
8 x r = ( λ 2 x p x q )   mod   m
9 y r = ( λ ( x p x r ) y p )   mod   m
10 return  [ x r , y r ]       /* R = P + Q */
end
An example of point addition R = P + Q on the Secp256k1 curve is shown below where [ P x , P y ] = P , [ Q x , Q y ] = Q , and [ R x , R y ] = R in affine coordinates.
  Px = 0xfc7dafb820a20da1a73c36465f2fe37bfd98ce4ef3a10a5df110abda03b20a3d
  Py = 0xa442a2d1b8bde4a09e45725add5daae89e726b56f0e8fe6609dacaf5279b2564
  Qx = 0xe106c069450b2663febb83e29b67fa93c4c48a45d5fbe7ce4ddb8ceb601fcc1d
  Qy = 0xc9da9bd440909c8862c06a44d432d2dd45284636b7049b9bf4695f9e4018d2f2
  Rx = 0xfd52a0334e16f8cf45a6b0820887a9e8b1b180516a76c8adfef95df98aeef376
  Ry = 0xb0fe3f04cc4c64fd66a133b8c97b4905771238f8ba89631efb85a8059e969a49

2.2.2. Point Doubling

Given P = [ x p , y p ] , the formulas for point doubling R = [ x r , y r ] = 2 P on elliptic curve y 2 = x 3 + a x + b m o d m are shown as follows, where λ is the slope of the tangent line of the elliptic curve at point P.
λ = 3 x p 2 + a 2 y p m o d m x r = ( λ 2 2 x p ) m o d m y r = ( λ ( x p x r ) y p ) m o d m
We give the point doubling R = 2 P algorithm over the finite field of F m in Algorithm 2. In the case of P y = 0 (vertical tangent line), R = 2 P = O (line 2 in the algorithm).
Algorithm 2 PD (P, m, a) (Point Doubling in Affine Coordinates).
inputs: Point P = [ P x , P y ] ; m and a in y 2 = x 3 + a x + b   mod   m
output: R = 2 P = [ R x , R y ] = [ x r , y r ]
begin
1 x p = P x , y p = P y , O = [ 1 , 1 ]
2 if  y p = 0  return  O      /* vertical tangent */
3 λ = ( ( 3 x p 2 + a ) / ( 2 y p ) ) m o d m
4 x r = ( λ 2 2 x p )   mod   m
5 y r = ( λ ( x p x r ) y p )   mod   m
6 return  [ x r , y r ]      /* R = 2 P */
end
An example of point doubling R = 2 P on the Secp256k1 curve is shown below where [ P x , P y ] = P and [ R x , R y ] = R in affine coordinates.
  Px = 0x6034b56424fb31ea6ec5483b52ae5d07d6f3ef80264d769ae2714abb83fb279a
  Py = 0xfe4cde1ff7546a87f906f50ab1002fda7811828ea6fc467a44d1c6c11aa65a37
  Rx = 0x5491ee8b73a4ed9713ed32e467de5100b80861babf8ffd09fd595ab457d042c9
  Ry = 0xf91e6a4e132a1bdf4f5c846559431ec7373de8872b719f188b5902932f0a2b30
The computation of λ in PA and PD requires modular division, which can be realized by a modular inversion algorithm based on the extended Euclidean algorithm.

2.3. Modular Inversion Algorithms

Given a prime number m, the inverse r of a number a with a < m is defined as
r = a 1 m o d m
That is, r a = 1   mod   m . The Python code below implements the modular inversion calculation using Fermat’s Little Theorem. When executed, it outputs 4 4 4. The first output value is calculated by the code, and the rest are for checking. This calculation consists of costly modular multiply and modular squaring, very similar to RSA exponentiation [16].
Preprints 118009 i001Preprints 118009 i002
The extended Euclidean algorithm can be used for the modular inversion calculation. Below is the fundamental extended Euclidean algorithm given in Python code (modinv_algo_1.py), where q is the integer quotient of u divided by v.
Preprints 118009 i003
Considering b = 1 . u and x are initialized with a and 1, respectively. At each iteration, u and x are modified with similar calculations. Therefore, when u reaches 1 from a, x reaches the reciprocal of a from 1. If m is a prime number, the greatest common divisor of a and m is guaranteed to be 1, and we can always get the inverse result of a. With the initialization of x with b, the algorithm performs the modular division r = b a 1 m o d m .
An example of running modinv_algo_1.py using modinv(1, 3, 11) is shown in Table 2 ( b = 1 , a = 3 , and m = 11 ). The calculation finishes when v = 0 . Because u = 1 , the result a 1 m o d m = x m o d m = 4 m o d 11 = 4 . We can check the correctness as follows: r a m o d m = 4 × 3 m o d 11 = 12 m o d 11 = 1 m o d 11 .
The algorithm requires division, which is expensive. We can remove division by setting the quotient to 0 or 1, as shown below (modinv_algo_2.py).
Preprints 118009 i004
The algorithm yields a quotient of 0 or 1 based on the comparison of u and v. If the quotient is a 1, a subtraction is performed. Otherwise, no calculation is performed, resulting in a slow calculation speed. The calculation of modinv (1, 3, 11) will require 9 iterations. We can modify the algorithm as follows (modinv_algo_3.py). This reduces the number of iterations by about half.
Preprints 118009 i005
We can check u first before the subtractions. If it is even, we can shift it to the right by one bit (the least significant bit 0 is shifted out). Correspondingly, x must also be shifted. To ensure that the value to be right-shifted is even, m will be added to x before the shift if x is odd. Note that m is odd because it is a prime number. This shift of u and x can be performed repeatedly until u becomes an odd number. Similar actions can be applied to v and y. Then we can have an algorithm as follows (modinv_algo_4.py). In fact, this is Algorithm 2.22 provided in [2] and implemented in Verilog HDL in [17].
Preprints 118009 i006
When the two inner while loops finish, u and v are both odd numbers. Therefore u v or v u is even. Then we can shift it to the right by one bit. Correspondingly, x y or y x must also be shifted. If x y or y x is odd, m must be added before the shift so that the bit being shifted out is 0. The algorithm is shown below (modinv_algo_5.py).
Preprints 118009 i007Preprints 118009 i008
The two inner while loops can be replaced by assigning u, x, v, y, or 0 to the temporary variables t u , t x , t v , and t y . The following assignments make t u t v even, so we can shift it to the right by one bit: (1) If both u and v are odd, those temporary variables are assigned with u, x, v, y, respectively. (2) If u is even and v is odd, those temporary variables are assigned with u, x, 0, 0, respectively. (3) If u is odd and v is even, those temporary variables are assigned with 0, 0, v, y, respectively. Such replacements reduce the latency from the carry propagate adder to the multiplexer and speed up the calculation. The algorithm is shown below (modinv_algo_6.py).
Preprints 118009 i009
The algorithm above requires calculations of t u t v , t v t u , t x t y , and t y t x . We can unify these calculations with negative assignments v and y to v and y, respectively. That is, u = u v = u + ( v ) becomes u = u + v , and x = x y = x + ( y ) becomes x = x + y with the negative assignments to v and y. Similarly, v = ( u v ) becomes v = u + v , and y = ( x y ) becomes y = x + y with the negative assignments to v and y. Therefore, u + v and x + y are sufficient for the calculation. The algorithm is given below (modinv_algo_7.py). Because of the negative assignments to v and y, v is initialized with m and y is initialized with 0 = 0 . Note that x is never greater than or equal to m. Therefore, no adjustment of x = x m or x = x m o d m is required.
Preprints 118009 i010Preprints 118009 i011
A modular inversion algorithm is said to be good when u reaches 1 quickly (high performance) and the algorithm uses a small number of adders and subtractors (low cost).

3. Proposed Radix-8 Modular Inversion Algorithm and its Performance

The proposed mixed radix-8 modular inversion algorithm is shown below (modinv_radix_8.py). To calculate r = a 1 m o d m , we initialize u = a and v = m with the negative assignment to v and y. The temporary variable t u is assigned with u or 0 and the temporary variable t v is assigned with v or 0 so that t u v = t u + t v is even. If the three least significant bits of t u v are 000, it is shifted to the right by three bits (radix-8). Otherwise, if the two least significant bits of t u v are 00, it is shifted to the right by two bits (radix-4). Otherwise, it is shifted to the right by one bit (radix-2), because t u v is even.
Preprints 118009 i012Preprints 118009 i013
Correspondingly, t x and t y are arranged and t x y = t x + t y is also shifted to the right by three bits, two bits, or one bit. The bits being shifted out must be 0. Therefore, we need to adjust t x y using the prime number m before shifting. Table 3 lists such adjustments based on the three least significant bits of t x y and the three least significant bits of m for the radix-8 operations, where x represents a don’t care. The three least significant bits of the adjusted value are 000, as shown in the comment column of the table.
Similarly, Table 4 lists the adjustments based on the two least significant bits of t x y and the two least significant bits of m for the radix-4 operations. The two least significant bits of the adjusted value are 00, as shown in the comment column of the table.
An example of c = b × a 1 m o d m is shown below. The modinv_radix_8 algorithm takes 206 iterations to reach u = 1 and v = 1 . In contrast, the modinv_radix_4 and modinv_radix_2 algorithms require 243 and 356 iterations, respectively.
  b = 0x9cfa1c993911914be0f15bd74a878abe0079c6254b961b82e1abda76387d1d85
  a = 0xd5076ae274e874c2eb0f7778717c39460236549ddd9fc651e68a0c0e787b4ce8
  m = 0xfffffffffffffffffffffffffffffffffffffffffffffffffffffffefffffc2f
  c = 0xe8e5ac2e1d3358894ce1b3342737b38c39b89059dd55d3c4741626de8270228e
To reduce the number of adders, we use multiplexers to select an appropriate value and assign it to the temporary variable t z . And then we perform t x y = t x + t y + t z . Based on the three least significant bits of t u v , we assign t x y 1 , t x y 2 , or t x y 3 (shift right) to x y . Figure 1 shows the block diagram of the proposed mixed radix-8 modular inversion circuit. Because we perform addition for t x y = t x + t y + t z , 2 m and 3 m are replaced by + 6 m and + 5 m , respectively. Also, for the addition, we prepare m that can be obtained by inverting all the bits of m and setting the right-most bit to 1 because m is odd. The registers u, v, x, and y are shown with red rectangles. The multiplexers are drawn with green rectangles. The blue rectangles are adders. The Verilog HDL implementation uses continuous assignment to compute u v and x y and writes them to the corresponding registers on the rising edge of the clock signal. Note that we have to use adders for generating 3 m , 5 m , and 6 m , which are not shown in the figure.
Below we give the hardware implementation code in Verilog HDL for the proposed mixed radix-8 modular inversion algorithm (modinv_r8.v). The signals start and ready indicate the start of the modular inversion calculation and the availability of the calculation result, respectively. Because we use Secp256k1 elliptic curve, the input and output signals b, a, m, and c are 256 bits. During the calculations, we use 260 bits for the internal signals.
Preprints 118009 i014Preprints 118009 i015
Below is the testbench Verilog HDL code used to simulate modinv_r8.v.
Preprints 118009 i016Preprints 118009 i017
Figure 2 shows the functional simulation waveform, generated with ModelSim. The result c is available at 416ns. That is, the calculation takes 208 clock cycles.
We have implemented the modular inversion algorithms on the Altera Cyclone V 5CGXFC9E7F35C8 FPGA chip. Table 5 lists the cost performance of the modular inversion algorithms. The column of Cycles shows the required number of clock cycles when executing the modular inversion algorithm. The column of Freq.(MHz) shows the clock frequency in MHz at which the circuit can work. The column of Latency( μ s) shows the execution time in microseconds calculated by dividing the clock cycles by the clock frequency. The column of ALMs shows the required number of adaptive logic modules. The column of Registers shows the required number of flip-flops. The flip-flops are mainly used to store u, v, x, and y. Their contents are updated on every clock cycle. The last column shows the AT factor, which is the product of the Latency in milliseconds and the sum of ALMs and Registers.
The row of [1] in the table shows the performance and cost of modular inversion using Fermat’s Little Theorem r = a m 2 m o d m = a 1 m o d m . It consists of costly modular multiply and modular squaring, very similar to RSA exponentiation [16]. Its AT factor is much higher than others. The remaining rows show the performance and cost of the EEA-based modular inversion algorithms. The number of registers used by [2,3,8], and [10] is larger than others. This is because extra registers are used to adjust the value of x or y so that the modular inversion result is within the range of 0 and m. Our algorithm implementation achieves an execution time of 3.67 μ s and an AT factor of 8.30, outperforming all other implementations. Figure 3 shows an intuitive view of latency and AT histograms.

4. ECC Implementation with Proposed Modular Inversion Algorithm

ECC relies on scalar point multiplication. Suppose P = [ x p , y p ] is a point on the curve, the scalar point multiplication Q = d P gets the Q = [ x q , y q ] that is also on the curve, where d = d n 1 d 1 d 0 is an n-bit scalar. Scalar point multiplication can be conducted with point addition (adding two points) and point doubling, as shown in Algorithm 3.
Algorithm 3 ScaMul (d, P, m, a) (Scalar Point Multiplication in Affine Coordinates).
inputs: d = d n 1 d 1 d 0 and point P = [ P x , P y ] ; m and a in y 2 = x 3 + a x + b m o d m
output: Q = d P
begin
1 Q = O , R = P , k = d /* Q = O and R = P */
2 while  k 0  to
3 if  k & 1 = 1
4 Q = PA (Q, R, m, a) /* Q = Q + R (Algorithm 1) */
5 R = PD (R, m, a) /* R = 2 R (Algorithm 2) */
6 k = k 1
7 endwhile
8 return Q /* Q = d P */
end
The algorithm calls point addition PA (P, Q, m, a) and point doubling PD (P, m, a). Table 6 gives an example to show the calculation steps of the scalar point multiplication. For a 5-bit d = 10101 2 = 21 , we calculate Q = d P in 5 steps to obtain Q = 21 P . We can see that the algorithm is similar to RSA exponentiation [16].
The ECDH algorithm is shown below in Python code that invokes scalar point multiplication four times. The code is hardware oriented. For code integrity, we listed our radix-8 code again but here + 6 m and + 5 m are used instead of 2 m and 3 m , respectively.
Preprints 118009 i018Preprints 118009 i019Preprints 118009 i020
Based on the above Python code, we implemented ECC using our radix-8 modular inversion algorithm for calculating λ in PA and PD. Figure 4 shows the functional simulation waveform of scalar point multiplication Q = d P with P = [ x , y ] and Q = [ q x , q y ] . The result is available at 635362ns. That is, the calculation takes 317681 clock cycles.
The ECC cost performance comparison is given in Table 7 when implementing on the Altera Cyclone V 5CGXFC9E7F35C8 FPGA chip. All ECC implementations use the same circuit except for the modular inversion part. Figure 5 shows the latency and AT histogram. The ECC latency using our proposed radix-8 modular inversion algorithm is 0.02007 second and its AT factor is 401237.93. From the table and histogram, we can see that our ECC implementation achieves a lower latency and a lower AT factor than all others.

5. Conclusions

In this paper, we proposed a mixed radix-8 modular inversion algorithm and hardware implementation based on 256-bit primes in Verilog HDL and compared its cost performance with other implementations on the Altera Cyclone V FPGA chip. The algorithm and its hardware implementation are area-time efficient with an AT factor of 8.30, which outperforms other algorithms and implementations. We also presented the cost performance of an elliptic curve cryptography implementation using the proposed modular inversion algorithm. Implementation results also show that our algorithm reduces execution time and requires fewer hardware resources than all other investigated algorithms.
Future research could include shortening the critical path and using carry-select adders and carry-save adders to speed up additions of large operands. Also, using a fixed prime m, Secp256k1 for example, we can simplify the circuit by considering only the case where the lower three bits are equal to 111 and using precomputations of 3 m , 5 m , and 6 m to speed up the radix-8 modular inversion calculations.

References

  1. Burton, D. The History of Mathematics / An Introduction (7th ed.); McGraw-Hill, 2011. [CrossRef]
  2. Hankerson, D.; Menezes, A.; Vanstone, S. Guide to Elliptic Curve Cryptography; Springer-Verlag: New York Inc, 2004. [Google Scholar] [CrossRef]
  3. Hossain, M.S.; Kong, Y. High-Performance FPGA Implementation of Modular Inversion over F_256 for Elliptic Curve Cryptography. 2015 IEEE International Conference on Data Science and Data Intensive Systems, 2015, pp. 169–174. [CrossRef]
  4. Daly, A.; Marnane, W.; Kerins, T.; Popovici, E. Division in GF(p) for Application in Elliptic Curve Cryptosystems on Field Programmable Logic. In New Algorithms, Architectures and Applications for Reconfigurable Computing; Springer US: Boston, MA, 2005; pp. 219–229. [CrossRef]
  5. Mrabet, A.; El-Mrabet, N.; Bouallegue, B.; Mesnager, S.; Machhout, M. An efficient and scalable modular inversion/division for public key cryptosystems. 2017 International Conference on Engineering & MIS (ICEMIS), 2017, pp. 1–6. [CrossRef]
  6. Chen, C.; Qin, Z. Fast Algorithm and Hardware Architecture for Modular Inversion in GF(p). 2009 Second International Conference on Intelligent Networks and Intelligent Systems, 2009, pp. 43–45. [CrossRef]
  7. Choi, P.; Lee, M.K.; Kong, J.T.; Kim, D.K. Efficient Design and Performance Analysis of a Hardware Right-shift Binary Modular Inversion Algorithm in GF(p). Journal of Semiconductor Technology and Science 2017, 17, 425–437. [Google Scholar] [CrossRef]
  8. Wang, D.; Lin, Y.; Hu, J.; Zhang, C.; Zhong, Q. FPGA Implementation for Elliptic Curve Cryptography Algorithm and Circuit with High Efficiency and Low Delay for IoT Applications. Micromachines 2023, 14, 1–15, https://www.mdpi.com/2072-666X/14/5/1037, doi:10.3390/mi14051037. [Google Scholar] [CrossRef] [PubMed]
  9. Yang, D.; Dai, Z.; Li, W.; Chen, T. An Efficient ASIC Implementation of Public Key Cryptography Algorithm SM2 Based on Module Arithmetic Logic Unit. 2019 IEEE 13th International Conference on ASIC (ASICON), 2019, pp. 1–4. [CrossRef]
  10. Yan, X.; Li, S. Modified modular inversion algorithm for VLSI implementation. 2007 7th International Conference on ASIC, 2007, pp. 90–93. [CrossRef]
  11. Dong, X.; Zhang, L.; Gao, X. An Efficient FPGA Implementation of ECC Modular Inversion over F256. Proceedings of the 2nd International Conference on Cryptography, Security and Privacy, 2018, pp. 29–33. [CrossRef]
  12. Koblitz, N. Elliptic curve cryptosystems. Mathematics of Computation 1987, 48, 203–209, https://www.ams.org/journals/mcom/1987-48-177/S0025-5718-1987-0866109-5/S0025-5718-1987-0866109-5.pdf. [Google Scholar] [CrossRef]
  13. Miller, V.S. Use of Elliptic Curves in Cryptography. Advances in Cryptology — CRYPTO ’85 Proceedings; Springer Berlin Heidelberg: Berlin, Heidelberg, 1986; pp. 417–426, https://link.springer.com/content/pdf/10.1007/3-540-39799-X_31.pdf?pdf=inline%20link. [Google Scholar]
  14. Certicom_Corp. Standards for Efficient Cryptography. SEC 2: Recommended Elliptic Curve Domain Parameters; http://www.secg.org/sec2-v2.pdf, 2010.
  15. Barker, E.; Chen, L.; Roginsky, A.; Vassilev, A.; Davis, R. SP 800-56A Rev. 3, Recommendation for Pair-Wise Key-Establishment Schemes Using Discrete Logarithm Cryptography; National Institute of Standards and Technology, 2018. https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-56Ar3.pdf.
  16. Li, Y.; Chu, W. Shift-Sub Modular Multiplication Algorithm and Hardware Implementation for RSA Cryptography. 17th International Conference on Information Assurance and Security, Lecture Notes in Networks and Systems; Springer: Cham, 2021; pp. 541–552. [Google Scholar] [CrossRef]
  17. Li, Y. Hardware Implementations of Elliptic Curve Cryptography Using Shift-Sub Based Modular Multiplication Algorithms. Cryptography 2023, 7, 1–29. [Google Scholar] [CrossRef]
Figure 1. Block diagram of the proposed mixed radix-8 modular inversion circuit.
Figure 1. Block diagram of the proposed mixed radix-8 modular inversion circuit.
Preprints 118009 g001
Figure 2. Waveform of modular inversion that calculates c = b a 1 m o d m .
Figure 2. Waveform of modular inversion that calculates c = b a 1 m o d m .
Preprints 118009 g002
Figure 3. Latency and AT comparison of modular inversion algorithms.
Figure 3. Latency and AT comparison of modular inversion algorithms.
Preprints 118009 g003
Figure 4. Waveform of scalar point multiplication Q = d P with P = [ x , y ] and Q = [ q x , q y ] .
Figure 4. Waveform of scalar point multiplication Q = d P with P = [ x , y ] and Q = [ q x , q y ] .
Preprints 118009 g004
Figure 5. ECC Latency and AT comparison of modular inversion algorithms.
Figure 5. ECC Latency and AT comparison of modular inversion algorithms.
Preprints 118009 g005
Table 1. Elliptic curve Diffie–Hellman key exchange.
Table 1. Elliptic curve Diffie–Hellman key exchange.
Expose an elliptic curve y 2 = x 3 + a x + b m o d m and a point P on the elliptic curve to the world
Alice Bob
Generate a secret d a Generate a secret d b
Calculate Q a = d a P Calculate Q b = d b P
Expose Q a Expose Q b
Get Q b from Bob Get Q a from Alice
Calculate Q a b = d a Q b Calculate Q b a = d b Q a
Use x of Q a b as the key Use x of Q b a as the key
Table 2. Execution example of modinv_algo_1.py: print(modinv(1, 3, 11)). It calculates r = 3 1 m o d 11 . The result is x m o d 11 = 4 .
Table 2. Execution example of modinv_algo_1.py: print(modinv(1, 3, 11)). It calculates r = 3 1 m o d 11 . The result is x m o d 11 = 4 .
i u v x y q
0 3 = a    11 = m 1 = b 0 q = u / v
0 u = v v = u q * v x = y y = x q * y
1    0 = 3 / 11    
1    11 = v    3 = 3 0 * 11     0 = y 1 = 1 0 * 0
2 3 = 11 / 3
2 3 = v 2 = 11 3 * 3 1 = y 3 = 0 3 * 1
3 1 = 3 / 2
3 2 = v 1 = 3 1 * 2     3 = y    4 = 1 1 * ( 3 )
4 2 = 2 / 1
4 1 = v 0 = 2 2 * 1 4 = y     11 = ( 3 ) 2 * 4    
   End    u = 1 v = 0 x = 4
Table 3. XY adjustment for shift right by three-bit in the proposed modular inversion algorithm.
Table 3. XY adjustment for shift right by three-bit in the proposed modular inversion algorithm.
txy m 2 m 3 m 4 m ( t xy Comment
000 xx1 ( t x y + 0 m ) / / 8 0 + 0 = 0
100 xx1 100 ( t x y + 4 m ) / / 8 4 + 4 = 8
010 x01 010 ( t x y 2 m ) / / 8 2 2 = 0
110 x11 110 ( t x y 2 m ) / / 8 6 6 = 0
010 x11 110 ( t x y + 2 m ) / / 8 2 + 6 = 8
110 x01 010 ( t x y + 2 m ) / / 8 6 + 2 = 8
001 001 ( t x y 1 m ) / / 8 1 1 = 0
011 011 ( t x y 1 m ) / / 8 3 3 = 0
101 101 ( t x y 1 m ) / / 8 5 5 = 0
111 111 ( t x y 1 m ) / / 8 7 7 = 0
001 101 010 111 ( t x y + 3 m ) / / 8 1 + 7 = 8
011 111 110 101 ( t x y + 3 m ) / / 8 3 + 5 = 8
101 001 010 011 ( t x y + 3 m ) / / 8 5 + 3 = 8
111 011 110 001 ( t x y + 3 m ) / / 8 7 + 1 = 8
001 111 ( t x y + 1 m ) / / 8 1 + 7 = 8
011 101 ( t x y + 1 m ) / / 8 3 + 5 = 8
101 011 ( t x y + 1 m ) / / 8 5 + 3 = 8
111 001 ( t x y + 1 m ) / / 8 7 + 1 = 8
001 011 110 001 ( t x y 3 m ) / / 8 1 1 = 0
011 001 010 011 ( t x y 3 m ) / / 8 3 3 = 0
101 111 110 101 ( t x y 3 m ) / / 8 5 5 = 0
111 101 010 111 ( t x y 3 m ) / / 8 7 7 = 0
Table 4. XY adjustment for shift right by two-bit in the proposed modular inversion algorithm.
Table 4. XY adjustment for shift right by two-bit in the proposed modular inversion algorithm.
txy m 2 m ( t xy Comment
00 x1 ( t x y + 0 m ) / / 4 0 + 0 = 0
10 x1 10 ( t x y + 2 m ) / / 4 2 + 2 = 4
01 01 ( t x y 1 m ) / / 4 1 1 = 0
11 11 ( t x y 1 m ) / / 4 3 3 = 0
01 11 ( t x y + 1 m ) / / 4 1 + 3 = 4
11 01 ( t x y + 1 m ) / / 4 3 + 1 = 4
Table 5. Comparison of modular inversion algorithms (on Altera Cyclone V FPGA chip).
Table 5. Comparison of modular inversion algorithms (on Altera Cyclone V FPGA chip).
Algorithm Cycles Freq.(MHz) Latency( μ s)   ALMs   Registers AT
[1] 66264 57.54 1151.63 2004 2775 5503.66
[2] 534 54.66 9.77 2619 1302 38.31
[3] 535 54.52 9.81 3735 1303 49.42
[4] 358 39.73 9.01 2474 1038 31.64
[5] 1205 64.55 18.67 1596 1043 49.26
[6] 723 72.21 10.01 1968 1042 30.13
[7] 358 63.60 5.63 959 1037 11.24
[8] 423 59.56 7.10 3475 1303 33.92
[9] 356 60.43 5.89 3950 1057 29.50
[10] 423 54.99 7.69 3644 1303 38.05
[11] 334 56.93 5.87 5276 1057 37.15
Ours 208 56.71 3.67 1227 1037 8.30
Table 6. Execution example of Q = d P with d = 10101 2 = 21 .
Table 6. Execution example of Q = d P with d = 10101 2 = 21 .
Weight Point addition Point doubling
Initial Q = O R = P
d 0 = 1 1 Q = Q + R = O + P = P R = 2 R = 2 P
d 1 = 0 2 R = 2 R = 4 P
d 2 = 1 4 Q = Q + R = P + 4 P = 5 P R = 2 R = 8 P
d 3 = 0 8 R = 2 R = 16 P
d 4 = 1 16 Q = Q + R = 5 P + 16 P = 21 P R = 2 R = 32 P
Table 7. ECC comparison using modular inversion algorithms (on Altera Cyclone V FPGA chip).
Table 7. ECC comparison using modular inversion algorithms (on Altera Cyclone V FPGA chip).
Algorithm Cycles Freq.(MHz) Latency(ms) ALMs Registers AT
[2] 402145 15.94 25.23 15043 8355 590300.42
[3] 402400 15.58 25.83 17585 8355 669977.92
[4] 357262 16.06 22.25 14975 7821 507107.38
[5] 570142 16.07 35.48 13292 7834 749522.08
[6] 455425 16.15 28.20 14211 7831 621577.58
[7] 356878 16.01 22.29 12114 7820 444347.67
[8] 372127 15.98 23.29 17292 8353 597196.30
[9] 352761 15.72 22.44 17841 7860 576737.31
[10] 372127 16.08 23.14 18548 8355 622595.32
[11] 346194 15.88 21.80 20716 7859 622952.99
Ours 317681 15.82 20.08 12157 7824 401237.93
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

© 2024 MDPI (Basel, Switzerland) unless otherwise stated