Preprint
Article

Hardware Implementations of Elliptic Curve Cryptography Using Shift-Sub Based Modular Multiplication Algorithms

Altmetrics

Downloads

310

Views

144

Comments

0

A peer-reviewed article of this preprint also exists.

This version is not peer-reviewed

Submitted:

30 August 2023

Posted:

31 August 2023

You are already at the latest version

Alerts
Abstract
Elliptic curve cryptography (ECC) over prime fields relies on scalar point multiplication realized by point addition and point doubling. Point addition and point doubling operations consist of many modular multiplications of large operands (256 bits for example), especially in projective and Jacobian coordinates which eliminate the modular inversion required in affine coordinates for every point addition or point doubling operation. Accelerating modular multiplication is therefore important for high-performance ECC. This paper presents the hardware implementations of modular multiplication algorithms, including 1) Interleaved modular multiplication (IMM), 2) Montgomery modular multiplication (MMM), 3) Shift-sub modular multiplication (SSMM), 4) SSMM with advance preparation (SSMMPRE), and 5) SSMM with CSAs and sign detection (SSMMCSA) algorithms, and evaluates their execution time (the number of clock cycles and clock frequency) and required hardware resources (ALMs and registers). Experimental results show that SSMM is 1.76 times faster than IMM, and SSMMCSA is 3.21 times faster than IMM. We also present the ECC hardware implementations based on the Secp256k1 protocol in affine, projective, and Jacobian coordinates using the IMM, SSMM, SSMMPRE, and SSMMCSA algorithms, and investigate their cost and performance. Our ECC implementations can be applied to the design of hardware security module systems.
Keywords: 
Subject: Computer Science and Mathematics  -   Security Systems

1. Introduction

The use of elliptic curves in cryptography is proposed by Neal Koblitz and Victor S. Miller independently in 1985 [1,2]. Elliptic curve cryptography (ECC) over finite field of a prime number m relies on the fact that scalar point multiplication Q = d P can be computed, but it is almost impossible to compute the multiplicand d given only the original point P and the point of the product Q. Scalar point multiplication can be done with point addition (adding two points) and point doubling [3]. In conventional affine coordinates, point addition and point doubling require computing the slope of a line involving division. In both projective and Jacobian coordinates, divisor multiplication is calculated such that division is eliminated during scalar point multiplication. One final division is required to convert a point from projective or Jacobian coordinates to affine coordinates to obtain the shared secret key based on the elliptic curve Diffie-Hellman (ECDH) key exchange. ECDH is a variant of the Diffie-Hellman key agreement protocol using ECC between two parties to establish a shared secret key over an insecure network [4,5]. A modular inversion algorithm calculating a 1 mod   m is given in [3]. Based on it, this paper provides a Verilog HDL implementation to calculate b a 1 mod   m . It can be used in affine coordinates to calculate the line slope, or in projective and Jacobian coordinates to convert points to affine coordinates in the final step for obtaining the shared secret key.
Point addition and point doubling operations consist of many modular multiplications, especially in projective and Jacobian coordinates. Accelerating modular multiplication is therefore important for high-performance ECC. Interleaved modular multiplication (IMM) algorithm [6] eliminates the need for division and has lower hardware cost. Montgomery modular multiplication (MMM) algorithm [7] performs multiplication in Montgomery domain and is very efficient for modular exponentiation used by RSA cryptography. Shift-sub modular multiplication (SSMM) algorithm [8,9] performs modular multiplication with shift and subtraction. An algorithm using CSA (Carry save adder) [10] is proposed but it requires either a modular computation after the iterations or a pre-computed look-up table for fixed multiplier and moduli (Y and M in the algorithm). This paper proposes two enhanced versions of SSMM: SSMM with advance preparation (SSMMPRE) and SSMM with CSAs and sign detection (SSMMCSA) algorithms. This paper also presents the ECC implementations based on Secp256k1 [4] protocol which is used in Bitcoin where a 256-bit m = 2 256 2 32 2 9 2 8 2 7 2 6 2 4 1 is used as the modulus.
The new contributions of this paper are summarized below. 1) Based on SSMM, we propose SSMMPRE (SSMM with advance preparation) and SSMMCSA (SSMM with CSAs and sign detection) algorithms that have lower latency than SSMM. 2) The hardware implementations of IMM, MMM, SSMM, SSMMPRE, and SSMMCSA algorithms are presented and their cost (required adaptive logic modules (ALMs) and registers) and performance (the number of clock cycles and clock frequency) are evaluated. 3) The hardware implementations of ECC in affine, projective, and Jacobian coordinates using the IMM, SSMM, SSMMPRE, and SSMMCSA algorithms are presented and their cost and performance are evaluated. And 4) some important Verilog HDL source codes and their simulation waveform are included. Compared to RSA cryptography, ECC provides stronger encryption with shorter key lengths. The ECC implementations described in this paper can be applied to the design of HSM (Hardware security module) systems.
The rest of the paper is organized as follows. Section 2 introduces the background of ECC, including the point addition, point doubling, scalar point multiplication, ECDH key agreement protocol, and affine, projective, and Jacobian coordinates. Section 3 describes the modular multiplication algorithms, including the IMM, MMM, SSMM, SSMMPRE, and SSMMCSA algorithms. Section 4 presents hardware implementations of modular multiplications and ECC and evaluates their cost and performance. And Section 5 concludes the paper. Verilog HDL codes for SSMM and modular inversion are listed in appendixes.

2. Elliptic Curve Cryptography Algorithms

This section describes ECC algorithms and ECDH that underlie this work. A top view of the relationship between these algorithms is shown in Figure 1.
The interesting computations in ECC are point addition and point doubling. We first introduce these two computations in an elliptic curve over the real numbers. In the real number field, an elliptic curve can be defined in Weierstrass form as
y 2 = x 3 + a x + b
Note that this curve is symmetrical about the x-axis. If point P = [ x p , y p ] is on an elliptic curve, then P = [ x p , y p ] is also on the same elliptic curve.

2.1. ECC Point Addition and Doubling in Affine Coordinates

Affine coordinates use two coordinates [ x , y ] to represent an elliptic curve point, as shown by, for example, P = [ x p , y p ] and P = [ x p , y p ] . We will see that the ECC point addition and point doubling in affine coordinates require expensive divisions.

2.1.1. ECC Point Addition in Affine Coordinates

Figure 2 shows the point addition R = P + Q on an elliptic curve y 2 = x 3 + a x + b . Given two distinct points P = [ x p , y p ] and Q = [ x q , y q ] on the curve, if the line L 1 through P and Q intersects the curve in S = [ x s , y s ] , then R = [ x r , y r ] = P + Q is defined as x r = x s and y r = y s . Because x r = x s , the line L 2 through S and R is a vertical line.
Below we show how to get formulas to calculate x r and y r based on x p , y p , x q , and y q for y 2 = x 3 + a x + b . The formula of the line L 1 through P and Q is
y = λ ( x x p ) + y p
where λ is the slope of the line L 1 . Squaring both the left side and right side of (2), we get y 2 = ( λ ( x x p ) + y p ) 2 . Then replacing y 2 of (1) with ( λ ( x x p ) + y p ) 2 , we have
x 3 λ 2 x 2 + ( a 2 y p λ + 2 x p λ 2 ) x + ( b ( y p λ x p ) 2 ) = 0
Because P, Q, and S are three points on the curve, meaning that x p , x q , and x s are three roots of (3), based on Vieta’s formulas, we have
( x x p ) ( x x q ) ( x x s ) = 0
Expanding (4) gives:
x 3 ( x p + x q + x s ) x 2 + ( x p x q + x q x s + x s x p ) x x p x q x s = 0
Then from (3) and (5) we have x p + x q + x s = λ 2 . That is, x s = λ 2 x p x q and y s = λ ( x s x p ) + y p . Considering the L 1 line slope λ = ( y q y p ) / ( x q x p ) , x r = x s , and y r = y s , we summarize the formulas for point addition R = P + Q = ( x r , y r ) on elliptic curve y 2 = x 3 + a x + b as follows.
λ = y q y p x q x p
x r = λ 2 x p x q
y r = λ ( x p x r ) y p
For Q = P , the line through P and P does not intersect the elliptic curve at the third point. For this reason, the point O at infinity is included in the group of elliptic curves and defined as P + ( P ) = O . By this definition, P + O = P .
In practice, a group of elliptic curves over a finite field of F m or F 2 n is used, where F m contains numbers from 0 to m 1 and F 2 n uses n-bit binary numbers. In the case of F m , the results of all the above calculations are modularized by m, where m is usually a prime number. For example, Secp256k1 [4] elliptic curve used in Bitcoin uses a 256-bit m = 2 256 2 32 2 9 2 8 2 7 2 6 2 4 1 . Secp256k1 defines y 2 = x 3 + a x + b = x 3 + 7 and gives a point P = [ x , y ] on the elliptic curve as follows.
a = 0x0000000000000000000000000000000000000000000000000000000000000000
b = 0x0000000000000000000000000000000000000000000000000000000000000007
m = 0xfffffffffffffffffffffffffffffffffffffffffffffffffffffffefffffc2f
x = 0x79be667ef9dcbbac55a06295ce870b07029bfcdb2dce28d959f2815b16f81798
y = 0x483ada7726a3c4655da4fbfc0e1108a8fd17b448a68554199c47d08ffb10d4b8
By considering P + O = P , we give the point addition R = P + Q algorithm over the finite field of F m in Algorithm 1. In our implementation, O is denoted as [ 1 , 1 ] . In the case of Q = P , we perform the point doubling R = 2 P (line 6 in the algorithm).
Preprints 83688 i004
An example of point addition R = P + Q on the Secp256k1 curve is shown below where [ P x , P y ] = P , [ Q x , Q y ] = Q , and [ R x , R y ] = R in affine coordinates.
Px = 0xe493dbf1c10d80f3581e4904930b1404cc6c13900ee0758474fa94abe8c4cd13
Py = 0x51ed993ea0d455b75642e2098ea51448d967ae33bfbdfe40cfe97bdc47739922
Qx = 0x79be667ef9dcbbac55a06295ce870b07029bfcdb2dce28d959f2815b16f81798
Qy = 0x483ada7726a3c4655da4fbfc0e1108a8fd17b448a68554199c47d08ffb10d4b8
Rx = 0x2f8bde4d1a07209355b4a7250a5c5128e88b84bddc619ab7cba8d569b240efe4
Ry = 0xd8ac222636e5e3d6d4dba9dda6c9c426f788271bab0d6840dca87d3aa6ac62d6

2.1.2. ECC Point Doubling in Affine Coordinates

Figure 3 shows the point doubling R = 2 P on an elliptic curve y 2 = x 3 + a x + b . Compared to the point addition shown in Figure 2, here we have Q = P and R = P + Q = 2 P . Given a point P = [ x p , y p ] on the curve, if the tangent line L 1 through P intersects the curve in S = [ x s , y s ] , then R = [ x r , y r ] = 2 P is defined as x r = x s and y r = y s .
Below we show how to get formulas to calculate x r and y r based on x p , y p , and a for y 2 = x 3 + a x + b . The formula of the line L 1 through P is
y = λ ( x x p ) + y p
where λ is the slope of the line L 1 . Squaring both the left side and right side of (9), we get y 2 = ( λ ( x x p ) + y p ) 2 . Then replacing y 2 of (1) with ( λ ( x x p ) + y p ) 2 , we have
x 3 λ 2 x 2 + ( a 2 y p λ + 2 x p λ 2 ) x + ( b ( y p λ x p ) 2 ) = 0
Because P, Q ( = P ), and S are three points on the curve, meaning that x p , x q , and x s are three roots of (10), based on Vieta’s formulas, we have
( x x p ) 2 ( x x s ) = 0
Expanding (11) gives:
x 3 ( 2 x p + x s ) x 2 + ( x p 2 + 2 x p x s ) x x p 2 x s = 0
Then from (10) and (12) we have 2 x p + x s = λ 2 . That is, x s = λ 2 2 x p and y s = λ ( x s x p ) + y p . The slope of tangent line L 1 of y 2 = x 3 + a x + b at P can be obtained as follows.
d d x ( y 2 ) = d d x ( x 3 + a x + b )
2 y d y d x = 3 x 2 + a
d y d x = ( 3 x 2 + a ) / ( 2 y )
λ = ( 3 x p 2 + a ) / ( 2 y p ) a t P
Considering x r = x s and y r = y s , we summarize the formulas for point doubling R = 2 P on elliptic curve y 2 = x 3 + a x + b as follows.
λ = 3 x p 2 + a 2 y p
x r = λ 2 2 x p
y r = λ ( x p x r ) y p
We give the point doubling R = 2 P algorithm over the finite field of F m in Algorithm 2. For P y = 0 , the tangent at P is vertical and does not intersect the elliptic curve at any other point. By definition, 2 P = O for such a point P (line 2 in the algorithm).
Preprints 83688 i005
An example of point doubling R = 2 P on the Secp256k1 curve is shown below where [ P x , P y ] = P , and [ R x , R y ] = R in affine coordinates.
Px = 0xb91dc87409c8a6b81e8d1be7f5fc86015cfa42f717d31a27d466bd042e29828d
Py = 0xc35b462fb20bec262308f9d785877752e63d5a68e563e898b4f82f47594680fc
Rx = 0x2d4fca9e0dff8dec3476a677d555896a0980ebccc6bc595a23675496dcc33bb5
Ry = 0xcce413eee9496094256e446b22fd234c03d9258330d77fc8b0d318a6aedba8cb

2.2. ECC Point Addition and Doubling in Projective Coordinates

Point addition and point doubling in affine coordinates require modular inversion (division) to calculate the line slope λ . We can eliminate the expensive modular inversion during calculations by using projective coordinates. In projective coordinates, a point is defined as P = [ X , Y , Z ] . Initially, we can convert a point [ X , Y ] in affine coordinates to a point in projective coordinates by P = [ X , Y , 1 ] . Then we calculate R = [ X r , Y r , Z r ] in projective coordinates using formulas that do not contain division. At the very final step, we can get the point [ x r , y r ] in affine coordinates with the transformation of x r = X r / Z r and y r = Y r / Z r , which require divisions.
The formulas for ECC point addition and doubling in projective coordinates can be derived based on the formulas in affine coordinates. A point P in projective coordinates is represented by the triple P = [ X , Y , Z ] , corresponding to the point [ x p , y p ] = [ X / Z , Y / Z ] in affine coordinates. That is, x p = X / Z and y p = Y / Z . We derive the point doubling formulas for R = [ X r , Y r , Z r ] = 2 P in projective coordinates as follows.
  From (17), we have
  λ = 3 x p 2 + a 2 y p = 3 ( X / Z ) 2 + a 2 ( Y / Z ) = 3 X 2 / Z 2 + a 2 Y / Z = 3 X 2 + a Z 2 2 Y Z
  From (18), we have
  x r = λ 2 2 x p = ( 3 X 2 + a Z 2 2 Y Z ) 2 2 X / Z = ( 3 X 2 + a Z 2 ) 2 ( 2 Y Z ) 2 8 X Y 2 Z ( 2 Y Z ) 2
  Let Z r = ( 2 Y Z ) 3 . Because x r = X r / Z r = X r / ( 2 Y Z ) 3 , i.e., X r = ( 2 Y Z ) 3 x r , then
  X r = 2 Y Z ( ( 3 X 2 + a Z 2 ) 2 8 X Y 2 Z )
  From (19), we have
y r = λ ( x p x r ) y p = 3 X 2 + a Z 2 2 Y Z ( X / Z ( 3 X 2 + a Z 2 ) 2 8 X Y 2 Z ( 2 Y Z ) 2 ) Y / Z
  = ( 3 X 2 + a Z 2 ) ( 4 X Y 2 Z ( ( 3 X 2 + a Z 2 ) 2 8 X Y 2 Z ) ) ( 2 Y Z ) 3 8 Y 4 Z 2 ( 2 Y Z ) 3
  Because y r = Y r / Z r = Y r / ( 2 Y Z ) 3 , i.e., Y r = ( 2 Y Z ) 3 y r , then
  Y r = ( 3 X 2 + a Z 2 ) ( 4 X Y 2 Z ( ( 3 X 2 + a Z 2 ) 2 8 X Y 2 Z ) ) 8 Y 4 Z 2
  Given P = [ X , Y , Z ] , the formulas for point doubling R = 2 P in projective coordinates are summarized below.
X r = 2 Y Z ( ( 3 X 2 + a Z 2 ) 2 8 X Y 2 Z )
Y r = ( 3 X 2 + a Z 2 ) ( 4 X Y 2 Z ( ( 3 X 2 + a Z 2 ) 2 8 X Y 2 Z ) ) 8 Y 4 Z 2
Z r = ( 2 Y Z ) 3
We can use the similar method to derive the formulas for point addition in projective coordinates. The derivation is omitted here but the calculations are shown in the following algorithm. It can be seen that the calculations require many more multiplications. Algorithm 3 gives the algorithm for point addition in projective coordinates. And Algorithm 4 gives the algorithm for point doubling in projective coordinates.
Preprints 83688 i006
Preprints 83688 i007
An example of point addition R = P + Q on the Secp256k1 curve is shown below where [ P x , P y , P z ] = P , [ Q x , Q y , Q z ] = Q , and [ R x , R y , R z ] = R in projective coordinates.
Px = 0x61bac660b055382e5906bd6e56e316542194b799b7bcf5ad05ee2171fd81735a
Py = 0xbe44ac0a2b712ccb6bb3ea933e4db0a4213c139078aef594cf8c2c5c2924d54d
Pz = 0x2b6da6fb02877584dc4d5111c88783772d7be5ac2866cce3707d53913384bf49
Qx = 0x6789c1137724f1f3f585337a1814eebc23ea329a0390fd9b1b9ece7af3e71ce1
Qy = 0xc9563c5035ccafec8673f56185141f720073ab3063bb417bf0e70e9d9128c232
Qz = 0x58990cd022b711912676c0451bdab6be04a06c1871b0139214bdbe81fd965555
Rx = 0xf4e9cb9ba9c18876b7b0ad000ce921b35e23139456f4f6c3f70e2fea149500a0
Ry = 0xc06176a9221b6d8b49a22130fb934b21358a1775df68d93ec308aca3ece072b5
Rz = 0x878fb153f0690416ba0ee136ec663debf8472f3ee92d350f9b3a42b4fd53fb27
An example of point doubling R = 2 P on the Secp256k1 curve is shown below where [ P x , P y , P z ] = P and [ R x , R y , R z ] = R in projective coordinates.
Px = 0x24fd537e9a5125438a02848f6b74725f678723f5c1450b8fb82a68f0c88c9764
Py = 0xe42e83a1d3d7c2241535b5c0ba5f2462c24bd87aaf9f15b05f3775d168b9bf6c
Pz = 0xe00794d20b32e0e94472c36b89cf5e5d6ec769b53dd6c1422e9467090c272305
Rx = 0x24d00c48ac8bbe61ceb0ac5daf5defd913af9220a07650642a3a41cad9030ee6
Ry = 0x75d429714ea6ce1ab3811d9adc16961a219e2812210fa8465042c18ecd5a0de6
Rz = 0x644a5a2964435364b74d7fa79fe0f06a5b1d2782e7f7b8d1e835db6d6b8786bc

2.3. ECC Point Addition and Doubling in Jacobian Coordinates

A point P in Jacobian coordinates is represented by the triple P = [ X , Y , Z ] , corresponding to the point [ x p , y p ] = [ X / Z 2 , Y / Z 3 ]  [2] (Page 424) in affine coordinates. That is, x p = X / Z 2 and y p = Y / Z 3 . We derive the point doubling formulas for R = [ X r , Y r , Z r ] = 2 P in Jacobian coordinates as follows.
  From (17), we have
  λ = 3 x p 2 + a 2 y p = 3 ( X / Z 2 ) 2 + a 2 ( Y / Z 3 ) = 3 X 2 / Z 4 + a 2 Y / Z 3 = 3 X 2 + a Z 4 2 Y Z
  From (18), we have
  x r = λ 2 2 x p = ( 3 X 2 + a Z 4 2 Y Z ) 2 2 X / Z 2 = ( 3 X 2 + a Z 4 ) 2 ( 2 Y Z ) 2 8 X Y 2 ( 2 Y Z ) 2
  Let Z r = 2 Y Z . Because x r = X r / Z r 2 = X r / ( 2 Y Z ) 2 , we have
  X r = ( 3 X 2 + a Z 4 ) 2 8 X Y 2
  From (19), we have
  y r = λ ( x p x r ) y p = 3 X 2 + a Z 4 2 Y Z ( X / Z 2 X r / ( 2 Y Z ) 2 ) Y / Z 3 =
  ( 3 X 2 + a Z 4 ) ( 4 X Y 2 X r ) ( 2 Y Z ) 3 8 Y 4 ( 2 Y Z ) 3
  Because y r = Y r / Z r 3 = Y r / ( 2 Y Z ) 3 , we have
  Y r = ( 3 X 2 + a Z 4 ) ( 4 X Y 2 X r ) 8 Y 4
  Given P = [ X , Y , Z ] , the formulas for point doubling R = 2 P in Jacobian coordinates are summarized below.
X r = ( 3 X 2 + a Z 4 ) 2 8 X Y 2
Y r = ( 3 X 2 + a Z 4 ) ( 4 X Y 2 X r ) 8 Y 4
Z r = 2 Y Z
  An example of point doubling R = 2 P on the Secp256k1 curve is shown below where [ P x , P y , P z ] = P and [ R x , R y , R z ] = R in Jacobian coordinates.
Px = 0xe43306185ef298127aef469d577aed78acafaddfc28ad0857491c38ffbedc475
Py = 0x4d83871239769596f65c180546c170a28cffca37bf6393025c457f406f54c517
Pz = 0xdafa620812722dceda7d93a91158dadbe11fee894e71eafa054d5f5fd274377e
Rx = 0x7d7cd6974d7e127a5fdf3f3c9c9eb5dcd9c15e033794466de63bcf2b9548ff85
Ry = 0x8f967b514296945a6dd052bca59ec1418a35cde3c6dd7b269d2e71daa80f851e
Rz = 0xcf89daf8b6c736cc851882b7e85c8ea8f703a9323a3d627909582b7904766035
Based on formulas (6), (7), and (8), and x = X / Z 2 and y = Y / Z 3 , the formulas for point addition in Jacobian coordinates can be derived which are omitted here. An example of point addition R = P + Q on the Secp256k1 curve is shown below where [ P x , P y , P z ] = P , [ Q x , Q y , Q z ] = Q , and [ R x , R y , R z ] = R in Jacobian coordinates.
Px = 0xb7bae589ec8a8c722c1ffb2c37fd4bbeda59074675c3eb50f1673ed46bbedfbe
Py = 0x81dee3398bdd718591c10762f61a0e41c4d609dffddcbeeb3894b8c4ce75e027
Pz = 0x51ec57b21350ad3d3466be5a7d28742279fbac1146fb4143767ee368a7dc741e
Qx = 0x0aa20a04dba4788e9b99e10f2e9f4d43b7f53916a5cacf2050dc70bc34c18d21
Qy = 0x60f318d01180b303f4b20a49c2b7e2b498405f88bd423a9a7cb92bab5f1b6abf
Qz = 0x3e1dcb6efa88b113f40b5858ea8c3cb5ddae2277c4683af9487e27023cba690d
Rx = 0x348248c47ad5d3186bd807c382659263840ba7ea13e61128d24337db9b0e5278
Ry = 0xb38573698dd6fef9ec93a9d68ae0a997191b678474cc00a13961defa3ed763e9
Rz = 0x76da2008046aae9901e6b96a7b54c42fd480de5cbc8cef6c0d0a9b3d7086f0f4
The point [ x r , y r ] in affine coordinates can be transformed from Jacobian coordinates by x r = X r / Z r 2 and y r = Y r / Z r 3 . This is only performed one time at the final step. It can be done with the modular inversion which we will introduce next.

2.4. Modular Inversion

In affine coordinates, the point addition and point doubling algorithms must compute the slope λ of a line. Projective and Jacobian coordinates require the point to be transformed into affine coordinates at the final step to obtain the shared secret key. These calculations or transformations require division and modulo operations.
Generally, the modular inversion calculates c = b a 1 mod   m , where m is an n-bit odd number and { a , b } < m . An example of modular inversion is shown below.
b = 0x9cfa1c993911914be0f15bd74a878abe0079c6254b961b82e1abda76387d1d85
a = 0xd5076ae274e874c2eb0f7778717c39460236549ddd9fc651e68a0c0e787b4ce8
m = 0xfffffffffffffffffffffffffffffffffffffffffffffffffffffffefffffc2f
c = 0xe8e5ac2e1d3358894ce1b3342737b38c39b89059dd55d3c4741626de8270228e
Algorithm 2.22 in [3] gives an algorithm to calculate a 1 mod   m using the extended Euclidean algorithm. Based on it, we give a Verilog HDL implementation of the modular inversion that calculates c = b a 1 mod   m in Appendix B. Figure B1 shows the simulation waveform generated with ModelSim.
In affine coordinates, every point addition or point doubling invokes modular inversion to calculate the line slope λ . In projective or Jacobian coordinates, this modular inversion can be removed during point addition or point doubling calculations, but a final division is required to transform a point from projective or Jacobian to affine coordinates to obtain the shared secret key. This division can be achieved using modular inversion.

2.5. Scalar Point Multiplication

Scalar point multiplication performs Q = d P where P and Q are elliptic curve points and d = d n 1 d 1 d 0 is an n-bit scalar. Scalar point multiplication is also called point multiplication, it can be done with “double-and-add” method [3]. Algorithm 5 formally gives the algorithm for scalar point multiplication. The algorithm invokes point addition and point doubling. Point doubling can be calculated simultaneously with point addition.
Preprints 83688 i008
Below we give a small example to show the calculation steps of the scalar point multiplication. For a 4-bit d = 1101 2 = 13 , we calculate Q = d P in 4 steps to get Q = 13 P .
Weight Point addition Point doubling
Initial Q = O R = P
d 0 = 1 1 Q = Q + R = O + P = P R = 2 R = 2 P
d 1 = 0 2 R = 2 R = 4 P
d 2 = 1 4 Q = Q + R = P + 4 P = 5 P R = 2 R = 8 P
d 3 = 1 8 Q = Q + R = 5 P + 8 P = 13 P R = 2 R = 16 P

2.6. Elliptic Curve Diffie-Hellman Key Exchange

Elliptic curve Diffie-Hellman (ECDH) is a variant of the Diffie-Hellman key agreement protocol using ECC between two parties to establish a shared secret key over an insecure network [4,5]. Then this shared secret key can be used by the two parties to encrypt and decrypt subsequent communications using a fast symmetric-key cryptography over the insecure network.
The ECDH key exchange protocol is shown in Table 1. Because Q a b = d a Q b = d a d b P , Q b a = d b Q a = d b d a P , and d a d b = d b d a , we have Q b a = Q a b . Below shows an ECDH key exchange example using Secp256k1. We can see that two parties have a same shared secret key (Qabx = Qbax).
da   = 0x650aa7095daeaa37ab9051541f0ce304f8969a6d88bb3bebb4fe680fca9a2595
db   = 0xedc68f194c4e30d6ef90467df822b00e5ef122dea48c9d1c54817080d1a341f4
Qax  = 0x167d2537aa6bbd8d978b58be0f9466520b7b184e205ff96a9ff567b35b32c7b7
Qay  = 0xde3961553d36551f92726fee0e332133960edddccd2784b98b2af730d2fc6e14
Qbx  = 0x839da64a414c2243a5526230603109be9c615613a9e98c3d650bb0488580bbda
Qby  = 0x96e88e99304a5afcdd77c4f3b3327a28162627ebe08194baa0c78dfb67a11042
Qabx = 0x1f254c7da15899275cdcab9d992f58251a4ab630fe9864d20cf317ab57749947
Qaby = 0xd6cb400b3c49d33d3df28f9d34fa09f8b6c8edf117a378c5a45d0a51e6c0debc
Qbax = 0x1f254c7da15899275cdcab9d992f58251a4ab630fe9864d20cf317ab57749947
Qbay = 0xd6cb400b3c49d33d3df28f9d34fa09f8b6c8edf117a378c5a45d0a51e6c0debc

3. Modular Multiplication Algorithms

Point addition and point doubling use many modular multiplications. This section describes interleaved modular multiplication (IMM), Montgomery modular multiplication (MMM), and shift-sub modular multiplication (SSMM) algorithms, and proposes shift-sub modular multiplication with advance preparation (SSMMPRE) and shift-sub modular multiplication with CSAs and sign detection (SSMMCSA) algorithms.

3.1. Interleaved Modular Multiplication Algorithm

The IMM algorithm [6] is formally given in Algorithm 6. It computes p = a b mod   m where R = 2 n , a , b < m < R , and m is an n-bit odd number. That is, the ( n 1 )th bit and 0th bit of n-bit m are 1. IMM begins with checking the ( n 1 )th bit of multiplier b. Therefore, in each iteration, the product p is shifted to the left by one bit (line 3). Because a , b < m , 2 p + a < 3 m (lines 3 and 4). Therefore, it is enough to subtract 2 m from 2 p + a (lines 5 and 6), ensuring p < m .
Preprints 83688 i009
Figure 4 shows a possible hardware implementation of Algorithm 6. Red squares are registers, others are combinational circuits. Clearly, the critical path is the right part that computes the new p in each iteration. It consists of three carry-propagate adders (CPAs) and three multiplexers. The most significant bit of the adder output (sign) can be used as the select signal of multiplexers. Note that 2 p can be realized by suitable wiring.

3.2. Montgomery Modular Multiplication Algorithm

The MMM algorithm [7] performs modular multiplication in Montgomery domain. It is very efficient for modular exponentiation used by RSA cryptography. MMM calculates p = a b R 1 mod   m , where R = 2 n , a , b < m < R , and m is an n-bit odd number with m n 1 = m 0 = 1 . It performs reduction R 1 during the multiplication a b because a and b are represented in Montgomery domain as follows.
a = a R mod   m
b = b R mod   m
where a and b are the operands in the conventional domain. Such a reduction ensures that the product p is an operand still in Montgomery domain:
p = a b R 1 mod   m = a R b R R 1 mod   m = a b R 2 R 1 mod   m = a b R mod   m
MMM is widely used in RSA cryptography where the modular exponentiation is realized with the repeated modular multiplications. MMM algorithm is formally given in Algorithm 7. We perform R 1 = 1 / 2 n = i = 0 n 1 1 / 2 in n iterations and divide p by 2 in each iteration. Line 3 performs multiplication. Line 4 makes p even for the reduction where p 0 is the least significant bit of p. Line 5 shifts p to right by one bit ( p / 2 ). Line 6 ensures p < m . The reason is shown below. Suppose p = a after the 0th iteration. For i 1 , p = p + a < 2 m , p = p + m < 3 m , p = p / 2 < 2 m , p m < m if p > = m .
Preprints 83688 i010
Figure 5 shows a possible hardware implementation of Algorithm 7. Red squares are registers, others are combinational circuits. Clearly, the critical path is the right part that computes the new p in each iteration. It consists of three CPAs and three multiplexers. The most significant bit of the adder output (sign) or the least significant bit p 0 of p can be used as the select signal of multiplexers. Note that p / 2 can be realized by suitable wiring.

3.3. Shift-Sub Modular Multiplication Algorithm

The SSMM algorithm [8,9] is given in Algorithm 8. It begins with checking the 0th bit of multiplier b to calculate p = a b mod m . Therefore, in each iteration, the multiplicand u ( = a ) is shifted to the left by one bit (line 5). Because a , b < m , p + u < 2 m (line 3). Therefore, it is enough to subtract m from p + u (line 4), ensuring p < m . Because a , b < m , 2 u < 2 m (line 5). Therefore, it is enough to subtract m from 2 u (line 6), ensuring u < m .
Preprints 83688 i011
Figure 6 shows a possible hardware implementation of Algorithm 8. Red squares are registers, others are combinational circuits. Compared to Figure 4 and Figure 5, here we reduce critical path to two CPAs and two multiplexers. The bit b i (b[cnt] in the figure) is used as a selection signal for the last (bottom) multiplexer. If it is a 0, the value in register p is selected (unchanged). The register u and its corresponding combinational circuits are added for performing lines 5 and 6 in Algorithm 8. Note that the implementation is slightly different from the algorithm. Regardless of b i , we calculate p + u and p + u m first. If p + u m is non-negative, select it, otherwise select p + u . Finally, the value selected by b i is written to register p.
Note that instead of using a multiplexer in the bottom, the bit b i can be used as a write enable for register p. That is, if b i = 0 , we just keep the content of register p unchanged.

3.4. Shift-Sub Modular Multiplication with Advance Preparation Algorithm

From the implementation of the SSMM algorithm, we can see that the critical path is the computation of p + u m = p + ( u m ) . m is a modulus and does not change during the multiplication. u is the multiplicand and doubles with each iteration. Then we can prepare u m in advance in the previous iteration. u is generated as follows. If 2 u m is non-negative, 2 u m is written to register u; otherwise, 2 u is written to register u, where u is the value of u in the previous iteration. In the current iteration, we have two cases for u m . Case 1: u m = 2 u m if 2 u m is negative. Case 2: u m = 2 u m m = 2 u 2 m if 2 u m is non-negative. Then we just prepare x = 2 u m and y = 2 u 2 m , and store them in registers. That is, if x < 0 , p + ( u m ) = p + x ; otherwise p + ( u m ) = p + y . The algorithm SSMMPRE is formally given in Algorithm 9.
Preprints 83688 i012
Figure 7 shows the hardware implementation of Algorithm 9. Registers x and y hold 2 u m and 2 u 2 m , respectively, where u is the value of u in the previous iteration.

3.5. Shift-Sub Modular Multiplication with CSAs and Sign Detection Algorithm

The algorithm SSMMCSA (SSMM with CSAs and sign detection) is formally given in Algorithm 10. Figure 8 shows the implementation of SSMMCSA. The two CPAs in Figure 6 are replaced with CSAs. Through our hardware implementation, we find that the part of CSAs that generates c and s is no longer the critical path. We use another register q to store c + s calculated with a CPA, and generate p from q. The critical path of this circuit contains a CPA and a multiplexer. The output of CSAs consists of carry c and sum s. The result value p will be ( c + s ) mod m . We will later propose an easy way to determine the sign of c + s from c and s without using CPA to calculate c + s . The upper CSA performs ( g , h ) C S A ( c , s , u ) and the other CSA performs ( x , y ) C S A ( g , h , m ) , corresponding to p + u and p + u m in Figure 6. One CSA’s output will be selected with the “csasign” (CSA’s sign) signal and stored to the CS register.
Preprints 83688 i013
Note that we can get m from m quickly. Usually m = m ¯ + 1 where + 1 needs a CPA. But here m is an n-bit odd value ( m [ 0 ] = 1 ), so that we can invert the left n 1 bits of m and leave the least-significant-bit unchanged, that is, m = { m [ n 1 : 1 ] ¯ , 1 ’b 1 } .
Now we describe how to generate the signal of “csasign”. Figure 9 shows an example of determining the sign of C + S from 17-bit C and S for a 16-bit a b mod m when using CSAs. We divide the 17 bits to 4 windows, namely W 3 , W 2 , W 1 , and W 0 . W 3 has 5 bits and each of W 2 , W 1 , and W 0 has 4 bits. We use one 5-bit and three 4-bit CPAs, carry-lookahead adders (CLAs) for instance, to perform additions in 4 windows simultaneously as follows. Note that each of adders generates a 5-bit result.
W 3 [ 4 : 0 ] = C [ 16 : 12 ] + S [ 16 : 12 ]   (Add 5-bit, 5-bit sum)
W 2 [ 4 : 0 ] = C [ 11 : 8 ] + S [ 11 : 8 ]   (Add 4-bit, 5-bit sum)
W 1 [ 4 : 0 ] = C [ 7 : 4 ] + S [ 7 : 4 ]   (Add 4-bit, 5-bit sum)
W 0 [ 4 : 0 ] = C [ 3 : 0 ] + S [ 3 : 0 ]   (Add 4-bit, 5-bit sum)
We summarize the three cases shown in Figure 9 as follows. The signal “sign_inverse” is true means that the original sign ( W 3 [ 4 ] ) is inverted when W 3 [ 3 : 0 ] = = 4 ’b1111.
sign_inverse = ( W 2 [ 4 ] = = 1 )    Case 2
| ( W 2 = = 5 ’b 01111 ) & ( W 1 [ 4 ] = = 1 )    Case 1
| ( W 2 = = 5 ’b 01111 ) & ( W 1 = = 5 ’b 01111 ) & ( W 0 [ 4 ] = = 1 )    Case 0
  The sign bit is a 1 if P = C + S is negative; a 0 otherwise. When W 3 [ 4 ] is a 0 and “sign_inverse” is true, the sign will be 1 at the condition of W 3 [ 3 : 0 ] = = 4 ’b1111. Similarly, when W 3 [ 4 ] is a 1 and “sign_inverse” is true, the sign will be 0 at the condition of W 3 [ 3 : 0 ] = = 4 ’b1111. Thus we have the following expression for determining the “csasign”.
csasign = W 3 [ 4 ] ( sign _ inverse & ( W 3 [ 3 : 0 ] = = 4 b 1111 ) )
  The symbol ⊕ denotes an eXclusive OR (XOR) operation. When W 3 [ 3 : 0 ] 4 ’b1111, the sign equals W 3 [ 4 ] , regardless any case for W 2 , W 1 , or W 0 . Note that W 2 , W 1 , and W 0 are sums of two 4-bit values c and s, so they do not have the pattern 5’b11111. The maximum sum is 5’b11110 when both c and s have a maximum value of 4’b1111.
In the example described above, the window size is 4. We can let the window size be 2 or 8. Then we can use eight 2-bit adders or two 8-bit adders to determine the sign of P = C + S . The former has a shorter latency introduced by adders, but the logic equation for determining the sign is more complicated. On the other hand, the latter has a simpler logic equation but the adder has a longer latency to determine the sign.
For a 256-bit ECC, there are more options for the number of adder bits and the number of windows. In order to design a low-cost and high-performance circuit, we instigate the cost (the number of ALMs) and performance (the circuit frequency) of the sign detection circuit for using CSAs in 256-bit ECC on FPGA (Field-programmable gate array) chip. There are 7 configurations. The experimental results are shown in Table 2 and Figure 10.
In general, decreasing CPA bits increases frequency. However, in our simulations, the frequency of the last configuration (128 windows and 2-bit adder) is lower than the configuration with 64 windows and 4-bit adder. This is because the logic equation for sign detection becomes complicated. We can also see that larger adders require more ALMs.
Based on the experimental results, our 256-bit ECC implementations use 4-bit adders, so there are 64 windows ( 4 × 64 = 256 ). A frequency of 392.46 MHz is measured when the FPGA chip implements only the sign detection circuitry. Implementing both the sign detection circuit and the CSAs results in a frequency of 343.76 MHz, lower than 392.46 MHz. The latency of the CSAs is the same as that of a 1-bit full adder. Clearly, it is less than that of the sign detection circuit. These experimental results imply that the frequency decreases as the circuit gets larger.
Note that the sign detection circuit itself is a combinational circuit. If we want to test its latency we can add registers on both the input and output sides. These registers are for testing purposes only and should be removed for ECC implementations. Otherwise, there will be a delay of two clock cycles, resulting in incorrect timing.

4. Hardware Implementations of Modular Multiplications and ECC

We have implemented modular multiplication algorithms and ECC in Verilog HDL. Unlike sequential program code in software, hardware modules can operate simultaneously. Considering data dependency, we must handle the synchronization between hardware modules using signals such as “start” and “ready”. This section describes these implementations and evaluates their cost and performance.

4.1. Hardware Implementations of Modular Multiplications

We have implemented the five modular multiplication algorithms described in the previous section. Since we use the 256-bit key proposed in Secp256k1 [4], the Verilog HDL code for modular multiplication also uses 256-bits. As an example of Verilog HDL code shown in Appendix A, the circuit calculates p = a b mod m , where m is a 256-bit odd number and { a , b } < m , as described in Algorithm 8 (SSMM). “start” is a one clock cycle active signal that tells the module to start modular multiplication. “ready”, “busy”, and “ready0” are synchronization signals with other modules. The simulation waveform generated with ModelSim can be found in Figure A1. It is easy to develop Verilog HDL code for other modular multiplication algorithms by referring to the SSMM example code.
Table 3 gives the cost-performance of the five modular multiplication algorithms. The column of clock cycles shows the required number of clock cycles when executing the modular multiplication algorithm. The column of frequency (MHz) shows the frequency in MHz at which the circuit can work. The column of latency ( μ s) shows the time in microsecond, that is calculated by dividing the clock cycles by the clock frequency. The column of ALMs shows the required number of adaptive logic modules. And the column of registers shows the required number of flip-flops.
Figure 11 plots the relative cost-performance of the modular multiplication algorithms to that of IMM. From this figure and table, we can see that SSMMCSA provides the highest frequency and lowest latency at the highest hardware cost.

4.2. Hardware Implementations of ECC

We have implemented ECC in affine, projective, and Jacobian coordinates using the IMM, SSMM, SSMMPRE, and SSMMCSA modular multiplication algorithms.
MMM is very efficient for modular exponentiation as the basic operation for RSA cryptography because it does not require modular (division). The modular exponentiation can be calculated by repeatedly calling MMM. However, transformations to the Montgomery domain are required before the calculations, and a transformation back to the normal domain is also required to obtain the final result. Domain transformation requires the value q = R 2 mod m which is calculated by costly modular operations. ECC in projective or Jacobian coordinates also does not require modular (division) during scalar point multiplication. Getting the key requires a final division to convert the point to affine coordinates. For these reasons, ECC using MMM is not implemented in this work.
Referring to Algorithm 5, in affine coordinates, the module of scalar point multiplication (scalarmult) invokes the module of point addition (addpoints) and the module of point doubling (doublepoint), as follows (simplified).
  module scalarmult (clk, rst_n, start, x, y, d, m, a, rx, ry, ready);
addpointsap(clk,rst_n,start_ap,x1,y1,x2,y2,m,a,apx,apy,ready_ap);
doublepointdp(clk,rst_n,start_dp,x1,y1,m,a,dpx,dpy,ready_dp);
endmodule
  Based on the start signal of module scalarmult, we can generate the start signals of start_ap for addpoints and start_dp for doublepoint. An important synchronization is to allow addpoints to start only after the previous iteration’s doublepoint has finished. After 256 iterations, the addition point [apx, apy] is the result point [rx, ry]. The modules of scalar point multiplication in projective and Jacobian coordinates invoke the point addition and doubling modules in projective and Jacobian coordinates, respectively.
In addpoints and doublepoint modules, the result point is calculated based on the algorithms of point addition and point doubling, as described in Algorithm 1 (or 3) and Algorithm 2 (or 4). These modules invoke
  (1) modadd (modular addition),
(2) modsub (modular subtraction),
(3) modmul (modular multiplication), and
(4) modinv (modular inversion).
  The first two modules are combinational circuits which take one clock cycle. The last two modules are sequential circuits for which we must generate the start signals. The Verilog HDL source codes of these two modules are given in the appendices Appendix A and Appendix B.
Referring to Table 1, in affine coordinates, x of the Q a b and Q b a can be used as the shared secret key for two parties. In projective and Jacobian coordinates, x must be calculated once by X / Z and X / Z 2 respectively. Such calculations can be performed using modmul and modinv as shown in the appendices Appendix A and Appendix B. Table 4 gives the cost-performance of the ECC in affine, projective, and Jacobian coordinates using the IMM, SSMM, SSMMPRE, and SSMMCSA modular multiplication algorithms.
The relative cost-performance of the ECC in affine, projective, and Jacobian coordinates are plotted in the Figure 12, Figure 13 and Figure 14, where the cost-performance of IMM is set to 1.0.
As shown in Table 3, the circuit SSMMCSA has the highest frequency, but applying it to the ECC design does not achieve a higher frequency than the other circuits, such as SSMM. This is because, in ECC design, there are other modules that have longer latency than CSAs and sign detection. For example, modadd takes one clock cycle to perform an addition on two operands a and b, a subtraction (subtracting the modulus m from the sum), and a selection using a multiplexer based on the sign of the subtraction result. Such calculations take longer time than the CSAs and sign detection operations. Meanwhile, SSMMCSA circuit is larger than SSMM, this also decreases the frequency.
Figure 15 shows the relative latency to IMM for ECC implementations. All SSMM-related implementations have lower latency than IMM. We can see that the SSMM in projective coordinates has the lowest latency. This is because, SSMMPRE and SSMMCSA use a lot of hardware resources, which reduces the frequency. Interestingly, the latency in projective coordinates is lower than that in Jacobian coordinates. The formulas for point addition and point doubling in Jacobian coordinates look simpler than those in projective coordinates, but from Formula 24, the calculation of Y r depends on X r , resulting in a sequential execution. On the other hand, Y r and X r in projective coordinates can be calculated in parallel (Formula 21 and Formula 20).
The ECC implementations presented in this paper are based on the Secp256k1 curve. It is easy to use the NIST Secp256r1 (P-256) curve [4,11]. Our experimental results show that the cost performance of ECC implementations based on both curves is almost the same.

5. Concluding Remarks

This paper introduced ECC modular multiplication algorithms and their hardware implementations. Experimental results show that SSMMCSA (Shift-sub modular multiplication with CSAs and sign detection) has lower latency than SSMM (Shift-sub modular multiplication). We also investigated ECC implementations in affine, projective, and Jacobian coordinates using the IMM (Interleaved modular multiplication), SSMM, SSMMPRE (Shift-sub modular multiplication with advance preparation) , and SSMMCSA algorithms. Experimental results show that ECC in projective coordinates with SSMM has the lowest latency among all implementations.
The SSMMCSA circuit itself has a higher frequency than SSMM, but the ECC circuits using SSMMCSA have the same or even lower frequency than that using SSMM. The next challenge is to make the SSMMCSA ECC circuit faster than the SSMM ECC circuit.

Appendix A. Verilog HDL Code of Shift-Sub Modular Multiplication (SSMM)

Preprints 83688 i001
This code implements Algorithm 8. Its block diagram is shown in Figure 6. The signal “start” is active for one clock cycle to tell the module to start modular multiplication. The signal “ready” remains active for one clock cycle to indicate that the modular multiplication result is available. “ready0” remains active until the multiplier is cleared. This signal is used to confirm the readiness of multiple modules to initiate another module’s operation. Figure A1 shows the simulation waveform generated with ModelSim.
Figure A1. Waveform of SSMM that calculates p = a b mod m . The Verilog HDL code is given in Appendix A.
Figure A1. Waveform of SSMM that calculates p = a b mod m . The Verilog HDL code is given in Appendix A.
Preprints 83688 g0a1

Appendix B. Verilog HDL Code of Modular Inversion

Preprints 83688 i002
This code is developed based on Algorithm 2.22 in [3]. The signal “start” is active for one clock cycle to tell the module to start modular inversion. The signal “ready” remains active for one clock cycle to indicate that the modular inversion result is available. “ready0” remains active until the multiplier is cleared. This signal is used to confirm the readiness of multiple modules to initiate another module’s operation. The testbench is listed below.
Preprints 83688 i003
Figure B1 shows the simulation waveform generated with ModelSim.
Figure B1. Waveform of modular inversion that calculates c = b a 1 mod m . The Verilog HDL code is given in Appendix B.
Figure B1. Waveform of modular inversion that calculates c = b a 1 mod m . The Verilog HDL code is given in Appendix B.
Preprints 83688 g0b1

References

  1. Koblitz, N. Elliptic curve cryptosystems. Mathematics of Computation 1987, 48, 203–209. https://www.ams.org/journals/mcom/1987-48-177/S0025-5718-1987-0866109-5/S0025-5718-1987-0866109-5.pdf.
  2. Miller, V.S. Use of Elliptic Curves in Cryptography. In Proceedings of the Advances in Cryptology — CRYPTO ’85 Proceedings, Berlin, Heidelberg, 1986; pp. 417–426. https://link.springer.com/content/pdf/10.1007/3-540-39799-X_31.pdf?pdf=inline%20link.
  3. Hankerson, D.; Menezes, A.; Vanstone, S. Guide to Elliptic Curve Cryptography; Springer-Verlag: New York Inc, 2004. [CrossRef]
  4. Certicom Corp. Standards for Efficient Cryptography. SEC 2: Recommended Elliptic Curve Domain Parameters; http://www.secg.org/sec2-v2.pdf, 2010.
  5. Barker, E.; Chen, L.; Roginsky, A.; Vassilev, A.; Davis, R. SP 800-56A Rev. 3, Recommendation for Pair-Wise Key-Establishment Schemes Using Discrete Logarithm Cryptography; National Institute of Standards and Technology, 2018. https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-56Ar3.pdf.
  6. Blakely, G.R. A Computer Algorithm for Calculating the Product AB Modulo M. IEEE Transactions on Computers 1983, C-32, 497–500. [CrossRef]
  7. Montgomery, P.L. Modular Multiplication Without Trial Division. Mathematics of Computation 1985, 44, 519–521. https://www.ams.org/journals/mcom/1985-44-170/S0025-5718-1985-0777282-X/S0025-5718-1985-0777282-X.pdf.
  8. Li, Y.; Chu, W. Shift-Sub Modular Multiplication Algorithm and Hardware Implementation for RSA Cryptography. In Proceedings of the 17th International Conference on Information Assurance and Security, Lecture Notes in Networks and Systems, Cham, 2021; pp. 541–552. [CrossRef]
  9. Li, Y.; Chu, W. Verilog HDL Implementation for an RSA Cryptography using Shift-Sub Modular Multiplication Algorithm. Journal of Information Assurance and Security 2022, 17, 113–121. http://www.mirlabs.org/jias/secured/Volume17-Issue3/Paper11.pdf.
  10. Bunimov, V.; Schimmler, M. Area and time efficient modular multiplication of large integers. In Proceedings of the Proceedings IEEE International Conference on Application-Specific Systems, Architectures, and Processors, 2003, pp. 400–409. [CrossRef]
  11. Chen, L.; Moody, D.; Regenscheid, A.; Robinson, A.; Randall, K. Recommendations for Discrete Logarithm-based Cryptography: Elliptic Curve Domain Parameters. NIST Special Publication NIST SP 800-186, 2023. https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-186.pdf. [CrossRef]
Figure 1. ECDH and ECC algorithms. The ECDH Key Exchange invokes a Scalar Point Multiplication that uses two computations - Point Addition and Point Doubling. Four primitive modular calculations (Addition, Subtraction, Multiplication, and Inversion) are used for these two computations.
Figure 1. ECDH and ECC algorithms. The ECDH Key Exchange invokes a Scalar Point Multiplication that uses two computations - Point Addition and Point Doubling. Four primitive modular calculations (Addition, Subtraction, Multiplication, and Inversion) are used for these two computations.
Preprints 83688 g001
Figure 2. Point addition R = P + Q on an elliptic curve y 2 = x 3 + a x + b in the real number field.
Figure 2. Point addition R = P + Q on an elliptic curve y 2 = x 3 + a x + b in the real number field.
Preprints 83688 g002
Figure 3. Point doubling R = 2 P on an elliptic curve y 2 = x 3 + a x + b in the real number field.
Figure 3. Point doubling R = 2 P on an elliptic curve y 2 = x 3 + a x + b in the real number field.
Preprints 83688 g003
Figure 4. Block diagram of interleaved modular multiplication (IMM). It implements Algorithm 6.
Figure 4. Block diagram of interleaved modular multiplication (IMM). It implements Algorithm 6.
Preprints 83688 g004
Figure 5. Block diagram of Montgomery modular multiplication (MMM). It implements Algorithm 7.
Figure 5. Block diagram of Montgomery modular multiplication (MMM). It implements Algorithm 7.
Preprints 83688 g005
Figure 6. Block diagram of shift-sub modular multiplication (SSMM). It implements Algorithm 8.
Figure 6. Block diagram of shift-sub modular multiplication (SSMM). It implements Algorithm 8.
Preprints 83688 g006
Figure 7. Block diagram of shift-sub modular multiplication with advance preparation (SSMMPRE). It implements Algorithm 9.
Figure 7. Block diagram of shift-sub modular multiplication with advance preparation (SSMMPRE). It implements Algorithm 9.
Preprints 83688 g007
Figure 8. Block diagram of shift-sub modular multiplication with CSAs and sign detection (SSMMCSA). It implements Algorithm 10.
Figure 8. Block diagram of shift-sub modular multiplication with CSAs and sign detection (SSMMCSA). It implements Algorithm 10.
Preprints 83688 g008
Figure 9. Determining the sign of P = C + S based on the 17-bit outputs C and S of CSAs. We divide 17 bits to four windows. There are five bits in the left-most window. Each of other three windows has four bits. The C and S are inputs; others are outputs of adders. W 3 [ 4 ] indicates the sign except for Case 0, Case 1, or Case 2, where the sign is the inverse of W 3 [ 4 ] when W 3 [ 3 : 0 ] = = 4 ’b1111.
Figure 9. Determining the sign of P = C + S based on the 17-bit outputs C and S of CSAs. We divide 17 bits to four windows. There are five bits in the left-most window. Each of other three windows has four bits. The C and S are inputs; others are outputs of adders. W 3 [ 4 ] indicates the sign except for Case 0, Case 1, or Case 2, where the sign is the inverse of W 3 [ 4 ] when W 3 [ 3 : 0 ] = = 4 ’b1111.
Preprints 83688 g009
Figure 10. Cost and performance of sign detection with different configurations.
Figure 10. Cost and performance of sign detection with different configurations.
Preprints 83688 g010
Figure 11. Cost performance comparison of modular multiplication algorithms.
Figure 11. Cost performance comparison of modular multiplication algorithms.
Preprints 83688 g011
Figure 12. Cost performance comparison of ECC in affine coordinates.
Figure 12. Cost performance comparison of ECC in affine coordinates.
Preprints 83688 g012
Figure 13. Cost performance comparison of ECC in projective coordinates.
Figure 13. Cost performance comparison of ECC in projective coordinates.
Preprints 83688 g013
Figure 14. Cost performance comparison of ECC in Jacobian coordinates.
Figure 14. Cost performance comparison of ECC in Jacobian coordinates.
Preprints 83688 g014
Figure 15. Relative latency of ECC. The SSMM in projective coordinates has the lowest latency.
Figure 15. Relative latency of ECC. The SSMM in projective coordinates has the lowest latency.
Preprints 83688 g015
Table 1. Elliptic curve Diffie-Hellman key exchange.
Table 1. Elliptic curve Diffie-Hellman key exchange.
Expose an elliptic curve y 2 = x 3 + a x + b mod m and a point P on the elliptic curve to the world
Alice Bob
Generate a secret d a Generate a secret d b
Calculate Q a = d a P (Algorithm 5) Calculate Q b = d b P (Algorithm 5)
Expose Q a Expose Q b
Get Q b from Bob Get Q a from Alice
Calculate Q a b = d a Q b (Algorithm 5) Calculate Q b a = d b Q a (Algorithm 5)
Use x of Q a b as the key Use x of Q b a as the key
Table 2. Cost and performance of sign detection with different configurations. The FPGA chip device is Cyclone V / 5CGXFC7D7F31C8.
Table 2. Cost and performance of sign detection with different configurations. The FPGA chip device is Cyclone V / 5CGXFC7D7F31C8.
Windows × Adder bits 2 × 128 4 × 64 8 × 32 16 × 16 32 × 8 64 × 4 128 × 2
ALMs 129 63 33 19 9 7 3
Frequency (MHz) 205.30 248.57 282.09 326.90 342.58 392.46 355.75
Table 3. Comparison of modular multiplication algorithms. SSMM is 1.76 times faster than IMM and SSMMCSA is 3.21 times faster than IMM. The FPGA chip device is Cyclone V / 5CGXFC7D7F31C8.
Table 3. Comparison of modular multiplication algorithms. SSMM is 1.76 times faster than IMM and SSMMCSA is 3.21 times faster than IMM. The FPGA chip device is Cyclone V / 5CGXFC7D7F31C8.
Algorithm Clock cycles Frequency (MHz) Latency ( μ s) ALMs Registers
IMM 258   35.15  7.34   657    268 
MMM 258   59.12  4.36   756    276 
SSMM 258   61.92  4.17   606    527 
SSMMPRE 258   66.16  3.90   847  1043 
SSMMCSA 259 113.29  2.29 1119  1043 
Table 4. Comparison of ECC in different coordinates with different modular multiplication algorithms. The FPGA chip device is Cyclone V / 5CGXFC7D7F31C8.
Table 4. Comparison of ECC in different coordinates with different modular multiplication algorithms. The FPGA chip device is Cyclone V / 5CGXFC7D7F31C8.
Algorithm Clock cycles Frequency (MHz) Latency (ms) ALMs Registers
ECC in Affine Coordinates
IMM 402,146 13.98 28.77 14,816   6,995
SSMM 402,146 20.41 19.70 15,032   8,085
SSMMPRE 402,146 20.69 19.44 16,723 12,743
SSMMCSA 403,166 20.40 19.76 18,416 12,834
ECC in Projective Coordinates
IMM 396,550 17.28 22.95 30,291 15,352
SSMM 396,550 29.49 13.45 29,878 23,434
SSMMPRE 396,550 28.54 13.89 37,545 41,902
SSMMCSA 398,080 21.31 18.68 46,831 47,984
ECC in Jacobian Coordinates
IMM 369,079 14.31 25.79 27,884 14,068
SSMM 369,079 21.15 17.45 28,978 21,708
SSMMPRE 369,079 20.95 17.62 35,864 39,024
SSMMCSA 370,502 17.14 21.62 44,591 45,298
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

© 2024 MDPI (Basel, Switzerland) unless otherwise stated