Hardware Implementations of Elliptic Curve Cryptography Using Shift-Sub Based Modular Multiplication Algorithms

Preprint

Article

Hardware Implementations of Elliptic Curve Cryptography Using Shift-Sub Based Modular Multiplication Algorithms

Altmetrics

Downloads

310

Views

144

Comments

A peer-reviewed article of this preprint also exists.

Yamin Li^*

This version is not peer-reviewed

Submitted:

30 August 2023

Posted:

31 August 2023

You are already at the latest version

Alerts

Abstract

Elliptic curve cryptography (ECC) over prime fields relies on scalar point multiplication realized by point addition and point doubling. Point addition and point doubling operations consist of many modular multiplications of large operands (256 bits for example), especially in projective and Jacobian coordinates which eliminate the modular inversion required in affine coordinates for every point addition or point doubling operation. Accelerating modular multiplication is therefore important for high-performance ECC. This paper presents the hardware implementations of modular multiplication algorithms, including 1) Interleaved modular multiplication (IMM), 2) Montgomery modular multiplication (MMM), 3) Shift-sub modular multiplication (SSMM), 4) SSMM with advance preparation (SSMMPRE), and 5) SSMM with CSAs and sign detection (SSMMCSA) algorithms, and evaluates their execution time (the number of clock cycles and clock frequency) and required hardware resources (ALMs and registers). Experimental results show that SSMM is 1.76 times faster than IMM, and SSMMCSA is 3.21 times faster than IMM. We also present the ECC hardware implementations based on the Secp256k1 protocol in affine, projective, and Jacobian coordinates using the IMM, SSMM, SSMMPRE, and SSMMCSA algorithms, and investigate their cost and performance. Our ECC implementations can be applied to the design of hardware security module systems.

Keywords:

Subject: Computer Science and Mathematics - Security Systems

1. Introduction

The use of elliptic curves in cryptography is proposed by Neal Koblitz and Victor S. Miller independently in 1985 [1,2]. Elliptic curve cryptography (ECC) over finite field of a prime number m relies on the fact that scalar point multiplication

Q = d P

can be computed, but it is almost impossible to compute the multiplicand d given only the original point P and the point of the product Q. Scalar point multiplication can be done with point addition (adding two points) and point doubling [3]. In conventional affine coordinates, point addition and point doubling require computing the slope of a line involving division. In both projective and Jacobian coordinates, divisor multiplication is calculated such that division is eliminated during scalar point multiplication. One final division is required to convert a point from projective or Jacobian coordinates to affine coordinates to obtain the shared secret key based on the elliptic curve Diffie-Hellman (ECDH) key exchange. ECDH is a variant of the Diffie-Hellman key agreement protocol using ECC between two parties to establish a shared secret key over an insecure network [4,5]. A modular inversion algorithm calculating

a^{- 1} \mod m

is given in [3]. Based on it, this paper provides a Verilog HDL implementation to calculate

b a^{- 1} \mod m

. It can be used in affine coordinates to calculate the line slope, or in projective and Jacobian coordinates to convert points to affine coordinates in the final step for obtaining the shared secret key.

Point addition and point doubling operations consist of many modular multiplications, especially in projective and Jacobian coordinates. Accelerating modular multiplication is therefore important for high-performance ECC. Interleaved modular multiplication (IMM) algorithm [6] eliminates the need for division and has lower hardware cost. Montgomery modular multiplication (MMM) algorithm [7] performs multiplication in Montgomery domain and is very efficient for modular exponentiation used by RSA cryptography. Shift-sub modular multiplication (SSMM) algorithm [8,9] performs modular multiplication with shift and subtraction. An algorithm using CSA (Carry save adder) [10] is proposed but it requires either a modular computation after the iterations or a pre-computed look-up table for fixed multiplier and moduli (Y and M in the algorithm). This paper proposes two enhanced versions of SSMM: SSMM with advance preparation (SSMMPRE) and SSMM with CSAs and sign detection (SSMMCSA) algorithms. This paper also presents the ECC implementations based on Secp256k1 [4] protocol which is used in Bitcoin where a 256-bit

m = 2^{256} - 2^{32} - 2^{9} - 2^{8} - 2^{7} - 2^{6} - 2^{4} - 1

is used as the modulus.

The new contributions of this paper are summarized below. 1) Based on SSMM, we propose SSMMPRE (SSMM with advance preparation) and SSMMCSA (SSMM with CSAs and sign detection) algorithms that have lower latency than SSMM. 2) The hardware implementations of IMM, MMM, SSMM, SSMMPRE, and SSMMCSA algorithms are presented and their cost (required adaptive logic modules (ALMs) and registers) and performance (the number of clock cycles and clock frequency) are evaluated. 3) The hardware implementations of ECC in affine, projective, and Jacobian coordinates using the IMM, SSMM, SSMMPRE, and SSMMCSA algorithms are presented and their cost and performance are evaluated. And 4) some important Verilog HDL source codes and their simulation waveform are included. Compared to RSA cryptography, ECC provides stronger encryption with shorter key lengths. The ECC implementations described in this paper can be applied to the design of HSM (Hardware security module) systems.

The rest of the paper is organized as follows. Section 2 introduces the background of ECC, including the point addition, point doubling, scalar point multiplication, ECDH key agreement protocol, and affine, projective, and Jacobian coordinates. Section 3 describes the modular multiplication algorithms, including the IMM, MMM, SSMM, SSMMPRE, and SSMMCSA algorithms. Section 4 presents hardware implementations of modular multiplications and ECC and evaluates their cost and performance. And Section 5 concludes the paper. Verilog HDL codes for SSMM and modular inversion are listed in appendixes.

2. Elliptic Curve Cryptography Algorithms

This section describes ECC algorithms and ECDH that underlie this work. A top view of the relationship between these algorithms is shown in Figure 1.

The interesting computations in ECC are point addition and point doubling. We first introduce these two computations in an elliptic curve over the real numbers. In the real number field, an elliptic curve can be defined in Weierstrass form as

y^{2} = x^{3} + a x + b

(1)

Note that this curve is symmetrical about the x-axis. If point

P = [x_{p}, y_{p}]

is on an elliptic curve, then

- P = [x_{p}, - y_{p}]

is also on the same elliptic curve.

2.1. ECC Point Addition and Doubling in Affine Coordinates

Affine coordinates use two coordinates

[x, y]

to represent an elliptic curve point, as shown by, for example,

P = [x_{p}, y_{p}]

and

- P = [x_{p}, - y_{p}]

. We will see that the ECC point addition and point doubling in affine coordinates require expensive divisions.

2.1.1. ECC Point Addition in Affine Coordinates

Figure 2 shows the point addition

R = P + Q

on an elliptic curve

y^{2} = x^{3} + a x + b

. Given two distinct points

P = [x_{p}, y_{p}]

and

Q = [x_{q}, y_{q}]

on the curve, if the line

L_{1}

through P and Q intersects the curve in

S = [x_{s}, y_{s}]

, then

R = [x_{r}, y_{r}] = P + Q

is defined as

x_{r} = x_{s}

and

y_{r} = - y_{s}

. Because

x_{r} = x_{s}

, the line

L_{2}

through S and R is a vertical line.

Below we show how to get formulas to calculate

x_{r}

and

y_{r}

based on

x_{p}

y_{p}

x_{q}

, and

y_{q}

for

y^{2} = x^{3} + a x + b

. The formula of the line

L_{1}

through P and Q is

y = λ (x - x_{p}) + y_{p}

(2)

where

λ

is the slope of the line

L_{1}

. Squaring both the left side and right side of (2), we get

y^{2} = {(λ (x - x_{p}) + y_{p})}^{2}

. Then replacing

y^{2}

of (1) with

{(λ (x - x_{p}) + y_{p})}^{2}

, we have

x^{3} - λ^{2} x^{2} + (a - 2 y_{p} λ + 2 x_{p} λ^{2}) x + (b - {(y_{p} - λ x_{p})}^{2}) = 0

(3)

Because P, Q, and S are three points on the curve, meaning that

x_{p}

x_{q}

, and

x_{s}

are three roots of (3), based on Vieta’s formulas, we have

(x - x_{p}) (x - x_{q}) (x - x_{s}) = 0

(4)

Expanding (4) gives:

x^{3} - (x_{p} + x_{q} + x_{s}) x^{2} + (x_{p} x_{q} + x_{q} x_{s} + x_{s} x_{p}) x - x_{p} x_{q} x_{s} = 0

(5)

Then from (3) and (5) we have

x_{p} + x_{q} + x_{s} = λ^{2}

. That is,

x_{s} = λ^{2} - x_{p} - x_{q}

and

y_{s} = λ (x_{s} - x_{p}) + y_{p}

. Considering the

L_{1}

line slope

λ = (y_{q} - y_{p}) / (x_{q} - x_{p})

x_{r} = x_{s}

, and

y_{r} = - y_{s}

, we summarize the formulas for point addition

R = P + Q = (x_{r}, y_{r})

on elliptic curve

y^{2} = x^{3} + a x + b

as follows.

λ = \frac{y_{q} - y_{p}}{x_{q} - x_{p}}

(6)

x_{r} = λ^{2} - x_{p} - x_{q}

(7)

y_{r} = λ (x_{p} - x_{r}) - y_{p}

(8)

For

Q = - P

, the line through P and

- P

does not intersect the elliptic curve at the third point. For this reason, the point O at infinity is included in the group of elliptic curves and defined as

P + (- P) = O

. By this definition,

P + O = P

In practice, a group of elliptic curves over a finite field of

F_{m}

F_{2^{n}}

is used, where

F_{m}

contains numbers from 0 to

m - 1

and

F_{2^{n}}

uses n-bit binary numbers. In the case of

F_{m}

, the results of all the above calculations are modularized by m, where m is usually a prime number. For example, Secp256k1 [4] elliptic curve used in Bitcoin uses a 256-bit

m = 2^{256} - 2^{32} - 2^{9} - 2^{8} - 2^{7} - 2^{6} - 2^{4} - 1

. Secp256k1 defines

y^{2} = x^{3} + a x + b = x^{3} + 7

and gives a point

P = [x, y]

on the elliptic curve as follows.

a = 0x0000000000000000000000000000000000000000000000000000000000000000

b = 0x0000000000000000000000000000000000000000000000000000000000000007

m = 0xfffffffffffffffffffffffffffffffffffffffffffffffffffffffefffffc2f

x = 0x79be667ef9dcbbac55a06295ce870b07029bfcdb2dce28d959f2815b16f81798

y = 0x483ada7726a3c4655da4fbfc0e1108a8fd17b448a68554199c47d08ffb10d4b8

By considering

P + O = P

, we give the point addition

R = P + Q

algorithm over the finite field of

F_{m}

in Algorithm 1. In our implementation, O is denoted as

[- 1, - 1]

. In the case of

Q = P

, we perform the point doubling

R = 2 P

(line 6 in the algorithm).

An example of point addition

R = P + Q

on the Secp256k1 curve is shown below where

[P_{x}, P_{y}] = P

[Q_{x}, Q_{y}] = Q

, and

[R_{x}, R_{y}] = R

in affine coordinates.

Px = 0xe493dbf1c10d80f3581e4904930b1404cc6c13900ee0758474fa94abe8c4cd13

Py = 0x51ed993ea0d455b75642e2098ea51448d967ae33bfbdfe40cfe97bdc47739922

Qx = 0x79be667ef9dcbbac55a06295ce870b07029bfcdb2dce28d959f2815b16f81798

Qy = 0x483ada7726a3c4655da4fbfc0e1108a8fd17b448a68554199c47d08ffb10d4b8

Rx = 0x2f8bde4d1a07209355b4a7250a5c5128e88b84bddc619ab7cba8d569b240efe4

Ry = 0xd8ac222636e5e3d6d4dba9dda6c9c426f788271bab0d6840dca87d3aa6ac62d6

2.1.2. ECC Point Doubling in Affine Coordinates

Figure 3 shows the point doubling

R = 2 P

on an elliptic curve

y^{2} = x^{3} + a x + b

. Compared to the point addition shown in Figure 2, here we have

Q = P

and

R = P + Q = 2 P

. Given a point

P = [x_{p}, y_{p}]

on the curve, if the tangent line

L_{1}

through P intersects the curve in

S = [x_{s}, y_{s}]

, then

R = [x_{r}, y_{r}] = 2 P

is defined as

x_{r} = x_{s}

and

y_{r} = - y_{s}

Below we show how to get formulas to calculate

x_{r}

and

y_{r}

based on

x_{p}

y_{p}

, and a for

y^{2} = x^{3} + a x + b

. The formula of the line

L_{1}

through P is

y = λ (x - x_{p}) + y_{p}

(9)

where

λ

is the slope of the line

L_{1}

. Squaring both the left side and right side of (9), we get

y^{2} = {(λ (x - x_{p}) + y_{p})}^{2}

. Then replacing

y^{2}

of (1) with

{(λ (x - x_{p}) + y_{p})}^{2}

, we have

x^{3} - λ^{2} x^{2} + (a - 2 y_{p} λ + 2 x_{p} λ^{2}) x + (b - {(y_{p} - λ x_{p})}^{2}) = 0

(10)

Because P, Q (

= P

), and S are three points on the curve, meaning that

x_{p}

x_{q}

, and

x_{s}

are three roots of (10), based on Vieta’s formulas, we have

{(x - x_{p})}^{2} (x - x_{s}) = 0

(11)

Expanding (11) gives:

x^{3} - (2 x_{p} + x_{s}) x^{2} + (x_{p}^{2} + 2 x_{p} x_{s}) x - x_{p}^{2} x_{s} = 0

(12)

Then from (10) and (12) we have

2 x_{p} + x_{s} = λ^{2}

. That is,

x_{s} = λ^{2} - 2 x_{p}

and

y_{s} = λ (x_{s} - x_{p}) + y_{p}

. The slope of tangent line

L_{1}

y^{2} = x^{3} + a x + b

at P can be obtained as follows.

\frac{d}{d x} (y^{2}) = \frac{d}{d x} (x^{3} + a x + b)

(13)

2 y \frac{d y}{d x} = 3 x^{2} + a

(14)

\frac{d y}{d x} = (3 x^{2} + a) / (2 y)

(15)

λ = (3 x_{p}^{2} + a) / (2 y_{p}) a t P

(16)

Considering

x_{r} = x_{s}

and

y_{r} = - y_{s}

, we summarize the formulas for point doubling

R = 2 P

on elliptic curve

y^{2} = x^{3} + a x + b

as follows.

λ = \frac{3 x_{p}^{2} + a}{2 y_{p}}

(17)

x_{r} = λ^{2} - 2 x_{p}

(18)

y_{r} = λ (x_{p} - x_{r}) - y_{p}

(19)

We give the point doubling

R = 2 P

algorithm over the finite field of

F_{m}

in Algorithm 2. For

P_{y} = 0

, the tangent at P is vertical and does not intersect the elliptic curve at any other point. By definition,

2 P = O

for such a point P (line 2 in the algorithm).

An example of point doubling

R = 2 P

on the Secp256k1 curve is shown below where

[P_{x}, P_{y}] = P

, and

[R_{x}, R_{y}] = R

in affine coordinates.

Px = 0xb91dc87409c8a6b81e8d1be7f5fc86015cfa42f717d31a27d466bd042e29828d

Py = 0xc35b462fb20bec262308f9d785877752e63d5a68e563e898b4f82f47594680fc

Rx = 0x2d4fca9e0dff8dec3476a677d555896a0980ebccc6bc595a23675496dcc33bb5

Ry = 0xcce413eee9496094256e446b22fd234c03d9258330d77fc8b0d318a6aedba8cb

2.2. ECC Point Addition and Doubling in Projective Coordinates

Point addition and point doubling in affine coordinates require modular inversion (division) to calculate the line slope

λ

. We can eliminate the expensive modular inversion during calculations by using projective coordinates. In projective coordinates, a point is defined as

P = [X, Y, Z]

. Initially, we can convert a point

[X, Y]

in affine coordinates to a point in projective coordinates by

P = [X, Y, 1]

. Then we calculate

R = [X_{r}, Y_{r}, Z_{r}]

in projective coordinates using formulas that do not contain division. At the very final step, we can get the point

[x_{r}, y_{r}]

in affine coordinates with the transformation of

x_{r} = X_{r} / Z_{r}

and

y_{r} = Y_{r} / Z_{r}

, which require divisions.

The formulas for ECC point addition and doubling in projective coordinates can be derived based on the formulas in affine coordinates. A point P in projective coordinates is represented by the triple

P = [X, Y, Z]

, corresponding to the point

[x_{p}, y_{p}] = [X / Z, Y / Z]

in affine coordinates. That is,

x_{p} = X / Z

and

y_{p} = Y / Z

. We derive the point doubling formulas for

R = [X_{r}, Y_{r}, Z_{r}] = 2 P

in projective coordinates as follows.

From (17), we have

λ = \frac{3 x_{p}^{2} + a}{2 y_{p}} = \frac{3 {(X / Z)}^{2} + a}{2 (Y / Z)} = \frac{3 X^{2} / Z^{2} + a}{2 Y / Z} = \frac{3 X^{2} + a Z^{2}}{2 Y Z}

From (18), we have

x_{r} = λ^{2} - 2 x_{p} = {(\frac{3 X^{2} + a Z^{2}}{2 Y Z})}^{2} - 2 X / Z = \frac{{(3 X^{2} + a Z^{2})}^{2}}{{(2 Y Z)}^{2}} - \frac{8 X Y^{2} Z}{{(2 Y Z)}^{2}}

Let

Z_{r} = {(2 Y Z)}^{3}

. Because

x_{r} = X_{r} / Z_{r} = X_{r} / {(2 Y Z)}^{3}

, i.e.,

X_{r} = {(2 Y Z)}^{3} x_{r}

, then

X_{r} = 2 Y Z ({(3 X^{2} + a Z^{2})}^{2} - 8 X Y^{2} Z)

From (19), we have

y_{r} = λ (x_{p} - x_{r}) - y_{p} = \frac{3 X^{2} + a Z^{2}}{2 Y Z} (X / Z - \frac{{(3 X^{2} + a Z^{2})}^{2} - 8 X Y^{2} Z}{{(2 Y Z)}^{2}}) - Y / Z

= \frac{(3 X^{2} + a Z^{2}) (4 X Y^{2} Z - ({(3 X^{2} + a Z^{2})}^{2} - 8 X Y^{2} Z))}{{(2 Y Z)}^{3}} - \frac{8 Y^{4} Z^{2}}{{(2 Y Z)}^{3}}

Because

y_{r} = Y_{r} / Z_{r} = Y_{r} / {(2 Y Z)}^{3}

, i.e.,

Y_{r} = {(2 Y Z)}^{3} y_{r}

, then

Y_{r} = (3 X^{2} + a Z^{2}) (4 X Y^{2} Z - ({(3 X^{2} + a Z^{2})}^{2} - 8 X Y^{2} Z)) - 8 Y^{4} Z^{2}

Given

P = [X, Y, Z]

, the formulas for point doubling

R = 2 P

in projective coordinates are summarized below.

X_{r} = 2 Y Z ({(3 X^{2} + a Z^{2})}^{2} - 8 X Y^{2} Z)

(20)

Y_{r} = (3 X^{2} + a Z^{2}) (4 X Y^{2} Z - ({(3 X^{2} + a Z^{2})}^{2} - 8 X Y^{2} Z)) - 8 Y^{4} Z^{2}

(21)

Z_{r} = {(2 Y Z)}^{3}

(22)

We can use the similar method to derive the formulas for point addition in projective coordinates. The derivation is omitted here but the calculations are shown in the following algorithm. It can be seen that the calculations require many more multiplications. Algorithm 3 gives the algorithm for point addition in projective coordinates. And Algorithm 4 gives the algorithm for point doubling in projective coordinates.

An example of point addition

R = P + Q

on the Secp256k1 curve is shown below where

[P_{x}, P_{y}, P_{z}] = P

[Q_{x}, Q_{y}, Q_{z}] = Q

, and

[R_{x}, R_{y}, R_{z}] = R

in projective coordinates.

Px = 0x61bac660b055382e5906bd6e56e316542194b799b7bcf5ad05ee2171fd81735a

Py = 0xbe44ac0a2b712ccb6bb3ea933e4db0a4213c139078aef594cf8c2c5c2924d54d

Pz = 0x2b6da6fb02877584dc4d5111c88783772d7be5ac2866cce3707d53913384bf49

Qx = 0x6789c1137724f1f3f585337a1814eebc23ea329a0390fd9b1b9ece7af3e71ce1

Qy = 0xc9563c5035ccafec8673f56185141f720073ab3063bb417bf0e70e9d9128c232

Qz = 0x58990cd022b711912676c0451bdab6be04a06c1871b0139214bdbe81fd965555

Rx = 0xf4e9cb9ba9c18876b7b0ad000ce921b35e23139456f4f6c3f70e2fea149500a0

Ry = 0xc06176a9221b6d8b49a22130fb934b21358a1775df68d93ec308aca3ece072b5

Rz = 0x878fb153f0690416ba0ee136ec663debf8472f3ee92d350f9b3a42b4fd53fb27

An example of point doubling

R = 2 P

on the Secp256k1 curve is shown below where

[P_{x}, P_{y}, P_{z}] = P

and

[R_{x}, R_{y}, R_{z}] = R

in projective coordinates.

Px = 0x24fd537e9a5125438a02848f6b74725f678723f5c1450b8fb82a68f0c88c9764

Py = 0xe42e83a1d3d7c2241535b5c0ba5f2462c24bd87aaf9f15b05f3775d168b9bf6c

Pz = 0xe00794d20b32e0e94472c36b89cf5e5d6ec769b53dd6c1422e9467090c272305

Rx = 0x24d00c48ac8bbe61ceb0ac5daf5defd913af9220a07650642a3a41cad9030ee6

Ry = 0x75d429714ea6ce1ab3811d9adc16961a219e2812210fa8465042c18ecd5a0de6

Rz = 0x644a5a2964435364b74d7fa79fe0f06a5b1d2782e7f7b8d1e835db6d6b8786bc

2.3. ECC Point Addition and Doubling in Jacobian Coordinates

A point P in Jacobian coordinates is represented by the triple

P = [X, Y, Z]

, corresponding to the point

[x_{p}, y_{p}] = [X / Z^{2}, Y / Z^{3}]

[2] (Page 424) in affine coordinates. That is,

x_{p} = X / Z^{2}

and

y_{p} = Y / Z^{3}

. We derive the point doubling formulas for

R = [X_{r}, Y_{r}, Z_{r}] = 2 P

in Jacobian coordinates as follows.

From (17), we have

λ = \frac{3 x_{p}^{2} + a}{2 y_{p}} = \frac{3 {(X / Z^{2})}^{2} + a}{2 (Y / Z^{3})} = \frac{3 X^{2} / Z^{4} + a}{2 Y / Z^{3}} = \frac{3 X^{2} + a Z^{4}}{2 Y Z}

From (18), we have

x_{r} = λ^{2} - 2 x_{p} = {(\frac{3 X^{2} + a Z^{4}}{2 Y Z})}^{2} - 2 X / Z^{2} = \frac{{(3 X^{2} + a Z^{4})}^{2}}{{(2 Y Z)}^{2}} - \frac{8 X Y^{2}}{{(2 Y Z)}^{2}}

Let

Z_{r} = 2 Y Z

. Because

x_{r} = X_{r} / Z_{r}^{2} = X_{r} / {(2 Y Z)}^{2}

, we have

X_{r} = {(3 X^{2} + a Z^{4})}^{2} - 8 X Y^{2}

From (19), we have

y_{r} = λ (x_{p} - x_{r}) - y_{p} = \frac{3 X^{2} + a Z^{4}}{2 Y Z} (X / Z^{2} - X_{r} / {(2 Y Z)}^{2}) - Y / Z^{3} =

\frac{(3 X^{2} + a Z^{4}) (4 X Y^{2} - X_{r})}{{(2 Y Z)}^{3}} - \frac{8 Y^{4}}{{(2 Y Z)}^{3}}

Because

y_{r} = Y_{r} / Z_{r}^{3} = Y_{r} / {(2 Y Z)}^{3}

, we have

Y_{r} = (3 X^{2} + a Z^{4}) (4 X Y^{2} - X_{r}) - 8 Y^{4}

Given

P = [X, Y, Z]

, the formulas for point doubling

R = 2 P

in Jacobian coordinates are summarized below.

X_{r} = {(3 X^{2} + a Z^{4})}^{2} - 8 X Y^{2}

(23)

Y_{r} = (3 X^{2} + a Z^{4}) (4 X Y^{2} - X_{r}) - 8 Y^{4}

(24)

Z_{r} = 2 Y Z

(25)

An example of point doubling

R = 2 P

on the Secp256k1 curve is shown below where

[P_{x}, P_{y}, P_{z}] = P

and

[R_{x}, R_{y}, R_{z}] = R

in Jacobian coordinates.

Px = 0xe43306185ef298127aef469d577aed78acafaddfc28ad0857491c38ffbedc475

Py = 0x4d83871239769596f65c180546c170a28cffca37bf6393025c457f406f54c517

Pz = 0xdafa620812722dceda7d93a91158dadbe11fee894e71eafa054d5f5fd274377e

Rx = 0x7d7cd6974d7e127a5fdf3f3c9c9eb5dcd9c15e033794466de63bcf2b9548ff85

Ry = 0x8f967b514296945a6dd052bca59ec1418a35cde3c6dd7b269d2e71daa80f851e

Rz = 0xcf89daf8b6c736cc851882b7e85c8ea8f703a9323a3d627909582b7904766035

Based on formulas (6), (7), and (8), and

x = X / Z^{2}

and

y = Y / Z^{3}

, the formulas for point addition in Jacobian coordinates can be derived which are omitted here. An example of point addition

R = P + Q

on the Secp256k1 curve is shown below where

[P_{x}, P_{y}, P_{z}] = P

[Q_{x}, Q_{y}, Q_{z}] = Q

, and

[R_{x}, R_{y}, R_{z}] = R

in Jacobian coordinates.

Px = 0xb7bae589ec8a8c722c1ffb2c37fd4bbeda59074675c3eb50f1673ed46bbedfbe

Py = 0x81dee3398bdd718591c10762f61a0e41c4d609dffddcbeeb3894b8c4ce75e027

Pz = 0x51ec57b21350ad3d3466be5a7d28742279fbac1146fb4143767ee368a7dc741e

Qx = 0x0aa20a04dba4788e9b99e10f2e9f4d43b7f53916a5cacf2050dc70bc34c18d21

Qy = 0x60f318d01180b303f4b20a49c2b7e2b498405f88bd423a9a7cb92bab5f1b6abf

Qz = 0x3e1dcb6efa88b113f40b5858ea8c3cb5ddae2277c4683af9487e27023cba690d

Rx = 0x348248c47ad5d3186bd807c382659263840ba7ea13e61128d24337db9b0e5278

Ry = 0xb38573698dd6fef9ec93a9d68ae0a997191b678474cc00a13961defa3ed763e9

Rz = 0x76da2008046aae9901e6b96a7b54c42fd480de5cbc8cef6c0d0a9b3d7086f0f4

The point

[x_{r}, y_{r}]

in affine coordinates can be transformed from Jacobian coordinates by

x_{r} = X_{r} / Z_{r}^{2}

and

y_{r} = Y_{r} / Z_{r}^{3}

. This is only performed one time at the final step. It can be done with the modular inversion which we will introduce next.

2.4. Modular Inversion

In affine coordinates, the point addition and point doubling algorithms must compute the slope

λ

of a line. Projective and Jacobian coordinates require the point to be transformed into affine coordinates at the final step to obtain the shared secret key. These calculations or transformations require division and modulo operations.

Generally, the modular inversion calculates

c = b a^{- 1} \mod m

, where m is an n-bit odd number and

{a, b} < m

. An example of modular inversion is shown below.

b = 0x9cfa1c993911914be0f15bd74a878abe0079c6254b961b82e1abda76387d1d85

a = 0xd5076ae274e874c2eb0f7778717c39460236549ddd9fc651e68a0c0e787b4ce8

m = 0xfffffffffffffffffffffffffffffffffffffffffffffffffffffffefffffc2f

c = 0xe8e5ac2e1d3358894ce1b3342737b38c39b89059dd55d3c4741626de8270228e

Algorithm 2.22 in [3] gives an algorithm to calculate

a^{- 1} \mod m

using the extended Euclidean algorithm. Based on it, we give a Verilog HDL implementation of the modular inversion that calculates

c = b a^{- 1} \mod m

in Appendix B. Figure B1 shows the simulation waveform generated with ModelSim.

In affine coordinates, every point addition or point doubling invokes modular inversion to calculate the line slope

λ

. In projective or Jacobian coordinates, this modular inversion can be removed during point addition or point doubling calculations, but a final division is required to transform a point from projective or Jacobian to affine coordinates to obtain the shared secret key. This division can be achieved using modular inversion.

2.5. Scalar Point Multiplication

Scalar point multiplication performs

Q = d P

where P and Q are elliptic curve points and

d = 〈 d_{n - 1} \dots d_{1} d_{0} 〉

is an n-bit scalar. Scalar point multiplication is also called point multiplication, it can be done with “double-and-add” method [3]. Algorithm 5 formally gives the algorithm for scalar point multiplication. The algorithm invokes point addition and point doubling. Point doubling can be calculated simultaneously with point addition.

Below we give a small example to show the calculation steps of the scalar point multiplication. For a 4-bit

d = 1101_{2} = 13

, we calculate

Q = d P

in 4 steps to get

Q = 13 P

	Weight	Point addition	Point doubling
Initial		$Q = O$	$R = P$
$d_{0} = 1$	1	$Q = Q + R = O + P = P$	$R = 2 R = 2 P$
$d_{1} = 0$	2		$R = 2 R = 4 P$
$d_{2} = 1$	4	$Q = Q + R = P + 4 P = 5 P$	$R = 2 R = 8 P$
$d_{3} = 1$	8	$Q = Q + R = 5 P + 8 P = 13 P$	$R = 2 R = 16 P$

2.6. Elliptic Curve Diffie-Hellman Key Exchange

Elliptic curve Diffie-Hellman (ECDH) is a variant of the Diffie-Hellman key agreement protocol using ECC between two parties to establish a shared secret key over an insecure network [4,5]. Then this shared secret key can be used by the two parties to encrypt and decrypt subsequent communications using a fast symmetric-key cryptography over the insecure network.

The ECDH key exchange protocol is shown in Table 1. Because

Q_{a b} = d_{a} Q_{b} = d_{a} d_{b} P

Q_{b a} = d_{b} Q_{a} = d_{b} d_{a} P

, and

d_{a} d_{b} = d_{b} d_{a}

, we have

Q_{b a} = Q_{a b}

. Below shows an ECDH key exchange example using Secp256k1. We can see that two parties have a same shared secret key (Qabx = Qbax).

da = 0x650aa7095daeaa37ab9051541f0ce304f8969a6d88bb3bebb4fe680fca9a2595

db = 0xedc68f194c4e30d6ef90467df822b00e5ef122dea48c9d1c54817080d1a341f4

Qax = 0x167d2537aa6bbd8d978b58be0f9466520b7b184e205ff96a9ff567b35b32c7b7

Qay = 0xde3961553d36551f92726fee0e332133960edddccd2784b98b2af730d2fc6e14

Qbx = 0x839da64a414c2243a5526230603109be9c615613a9e98c3d650bb0488580bbda

Qby = 0x96e88e99304a5afcdd77c4f3b3327a28162627ebe08194baa0c78dfb67a11042

Qabx = 0x1f254c7da15899275cdcab9d992f58251a4ab630fe9864d20cf317ab57749947

Qaby = 0xd6cb400b3c49d33d3df28f9d34fa09f8b6c8edf117a378c5a45d0a51e6c0debc

Qbax = 0x1f254c7da15899275cdcab9d992f58251a4ab630fe9864d20cf317ab57749947

Qbay = 0xd6cb400b3c49d33d3df28f9d34fa09f8b6c8edf117a378c5a45d0a51e6c0debc

3. Modular Multiplication Algorithms

Point addition and point doubling use many modular multiplications. This section describes interleaved modular multiplication (IMM), Montgomery modular multiplication (MMM), and shift-sub modular multiplication (SSMM) algorithms, and proposes shift-sub modular multiplication with advance preparation (SSMMPRE) and shift-sub modular multiplication with CSAs and sign detection (SSMMCSA) algorithms.

3.1. Interleaved Modular Multiplication Algorithm

The IMM algorithm [6] is formally given in Algorithm 6. It computes

p = a b \mod m

where

R = 2^{n}

a, b < m < R

, and m is an n-bit odd number. That is, the (

n - 1

)th bit and 0th bit of n-bit m are 1. IMM begins with checking the (

n - 1

)th bit of multiplier b. Therefore, in each iteration, the product p is shifted to the left by one bit (line 3). Because

a, b < m

2 p + a < 3 m

(lines 3 and 4). Therefore, it is enough to subtract

2 m

from

2 p + a

(lines 5 and 6), ensuring

p < m

Figure 4 shows a possible hardware implementation of Algorithm 6. Red squares are registers, others are combinational circuits. Clearly, the critical path is the right part that computes the new p in each iteration. It consists of three carry-propagate adders (CPAs) and three multiplexers. The most significant bit of the adder output (sign) can be used as the select signal of multiplexers. Note that

2 p

can be realized by suitable wiring.

3.2. Montgomery Modular Multiplication Algorithm

The MMM algorithm [7] performs modular multiplication in Montgomery domain. It is very efficient for modular exponentiation used by RSA cryptography. MMM calculates

p = a b R^{- 1} \mod m

, where

R = 2^{n}

a, b < m < R

, and m is an n-bit odd number with

m_{n - 1} = m_{0} = 1

. It performs reduction

R^{- 1}

during the multiplication

a b

because a and b are represented in Montgomery domain as follows.

a = a^{'} R \mod m

b = b^{'} R \mod m

where

a^{'}

and

b^{'}

are the operands in the conventional domain. Such a reduction ensures that the product p is an operand still in Montgomery domain:

p = a b R^{- 1} \mod m = a^{'} R b^{'} R R^{- 1} \mod m = a^{'} b^{'} R^{2} R^{- 1} \mod m = a^{'} b^{'} R \mod m

MMM is widely used in RSA cryptography where the modular exponentiation is realized with the repeated modular multiplications. MMM algorithm is formally given in Algorithm 7. We perform

R^{- 1} = 1 / 2^{n} = \prod_{i = 0}^{n - 1} 1 / 2

in n iterations and divide p by 2 in each iteration. Line 3 performs multiplication. Line 4 makes p even for the reduction where

p_{0}

is the least significant bit of p. Line 5 shifts p to right by one bit (

p / 2

). Line 6 ensures

p < m

. The reason is shown below. Suppose

p = a

after the 0th iteration. For

i \geq 1

p = p + a < 2 m

p = p + m < 3 m

p = p / 2 < 2 m

p - m < m

p > = m

Figure 5 shows a possible hardware implementation of Algorithm 7. Red squares are registers, others are combinational circuits. Clearly, the critical path is the right part that computes the new p in each iteration. It consists of three CPAs and three multiplexers. The most significant bit of the adder output (sign) or the least significant bit

p_{0}

of p can be used as the select signal of multiplexers. Note that

p / 2

can be realized by suitable wiring.

3.3. Shift-Sub Modular Multiplication Algorithm

The SSMM algorithm [8,9] is given in Algorithm 8. It begins with checking the 0th bit of multiplier b to calculate

p = a b \mod m

. Therefore, in each iteration, the multiplicand u (

= a

) is shifted to the left by one bit (line 5). Because

a, b < m

p + u < 2 m

(line 3). Therefore, it is enough to subtract m from

p + u

(line 4), ensuring

p < m

. Because

a, b < m

2 u < 2 m

(line 5). Therefore, it is enough to subtract m from

2 u

(line 6), ensuring

u < m

Figure 6 shows a possible hardware implementation of Algorithm 8. Red squares are registers, others are combinational circuits. Compared to Figure 4 and Figure 5, here we reduce critical path to two CPAs and two multiplexers. The bit

b_{i}

(b[cnt] in the figure) is used as a selection signal for the last (bottom) multiplexer. If it is a 0, the value in register p is selected (unchanged). The register u and its corresponding combinational circuits are added for performing lines 5 and 6 in Algorithm 8. Note that the implementation is slightly different from the algorithm. Regardless of

b_{i}

, we calculate

p + u

and

p + u - m

first. If

p + u - m

is non-negative, select it, otherwise select

p + u

. Finally, the value selected by

b_{i}

is written to register p.

Note that instead of using a multiplexer in the bottom, the bit

b_{i}

can be used as a write enable for register p. That is, if

b_{i} = 0

, we just keep the content of register p unchanged.

3.4. Shift-Sub Modular Multiplication with Advance Preparation Algorithm

From the implementation of the SSMM algorithm, we can see that the critical path is the computation of

p + u - m = p + (u - m)

. m is a modulus and does not change during the multiplication. u is the multiplicand and doubles with each iteration. Then we can prepare

u - m

in advance in the previous iteration. u is generated as follows. If

2 u^{'} - m

is non-negative,

2 u^{'} - m

is written to register u; otherwise,

2 u^{'}

is written to register u, where

u^{'}

is the value of u in the previous iteration. In the current iteration, we have two cases for

u - m

. Case 1:

u - m = 2 u^{'} - m

2 u^{'} - m

is negative. Case 2:

u - m = 2 u^{'} - m - m = 2 u^{'} - 2 m

2 u^{'} - m

is non-negative. Then we just prepare

x = 2 u^{'} - m

and

y = 2 u^{'} - 2 m

, and store them in registers. That is, if

x < 0

p + (u - m) = p + x

; otherwise

p + (u - m) = p + y

. The algorithm SSMMPRE is formally given in Algorithm 9.

Figure 7 shows the hardware implementation of Algorithm 9. Registers x and y hold

2 u^{'} - m

and

2 u^{'} - 2 m

, respectively, where

u^{'}

is the value of u in the previous iteration.

3.5. Shift-Sub Modular Multiplication with CSAs and Sign Detection Algorithm

The algorithm SSMMCSA (SSMM with CSAs and sign detection) is formally given in Algorithm 10. Figure 8 shows the implementation of SSMMCSA. The two CPAs in Figure 6 are replaced with CSAs. Through our hardware implementation, we find that the part of CSAs that generates c and s is no longer the critical path. We use another register q to store

c + s

calculated with a CPA, and generate p from q. The critical path of this circuit contains a CPA and a multiplexer. The output of CSAs consists of carry c and sum s. The result value p will be

(c + s) \mod m

. We will later propose an easy way to determine the sign of

c + s

from c and s without using CPA to calculate

c + s

. The upper CSA performs

(g, h) \leftarrow C S A (c, s, u)

and the other CSA performs

(x, y) \leftarrow C S A (g, h, - m)

, corresponding to

p + u

and

p + u - m

in Figure 6. One CSA’s output will be selected with the “csasign” (CSA’s sign) signal and stored to the CS register.

Note that we can get

- m

from m quickly. Usually

- m = \bar{m} + 1

where

+ 1

needs a CPA. But here m is an n-bit odd value (

m [0] = 1

), so that we can invert the left

n - 1

bits of m and leave the least-significant-bit unchanged, that is,

- m = {\bar{m [n - 1 : 1]}, 1

’b

1}

Now we describe how to generate the signal of “csasign”. Figure 9 shows an example of determining the sign of

C + S

from 17-bit C and S for a 16-bit

a b \mod m

when using CSAs. We divide the 17 bits to 4 windows, namely

W 3

W 2

W 1

, and

W 0

W 3

has 5 bits and each of

W 2

W 1

, and

W 0

has 4 bits. We use one 5-bit and three 4-bit CPAs, carry-lookahead adders (CLAs) for instance, to perform additions in 4 windows simultaneously as follows. Note that each of adders generates a 5-bit result.

$W 3 [4 : 0]$	=	$C [16 : 12]$	+	$S [16 : 12]$	(Add 5-bit, 5-bit sum)
$W 2 [4 : 0]$	=	$C [11 : 8]$	+	$S [11 : 8]$	(Add 4-bit, 5-bit sum)
$W 1 [4 : 0]$	=	$C [7 : 4]$	+	$S [7 : 4]$	(Add 4-bit, 5-bit sum)
$W 0 [4 : 0]$	=	$C [3 : 0]$	+	$S [3 : 0]$	(Add 4-bit, 5-bit sum)

We summarize the three cases shown in Figure 9 as follows. The signal “sign_inverse” is true means that the original sign (

W 3 [4]

) is inverted when

W 3 [3 : 0] = = 4

’b1111.

sign_inverse	=	$(W 2 [4] = = 1)$					Case 2
	\|	$(W 2 = = 5$ ’b $01111)$	&	$(W 1 [4] = = 1)$			Case 1
	\|	$(W 2 = = 5$ ’b $01111)$	&	$(W 1 = = 5$ ’b $01111)$	&	$(W 0 [4] = = 1)$	Case 0

The sign bit is a 1 if

P = C + S

is negative; a 0 otherwise. When

W 3 [4]

is a 0 and “sign_inverse” is true, the sign will be 1 at the condition of

W 3 [3 : 0] = = 4

’b1111. Similarly, when

W 3 [4]

is a 1 and “sign_inverse” is true, the sign will be 0 at the condition of

W 3 [3 : 0] = = 4

’b1111. Thus we have the following expression for determining the “csasign”.

\begin{matrix} csasign & = & W 3 [4] & \oplus & (sign_inverse & & & (W 3 [3 : 0] & = = & 4^{'} b 1111)) \end{matrix}

(26)

The symbol ⊕ denotes an eXclusive OR (XOR) operation. When

W 3 [3 : 0] \neq 4

’b1111, the sign equals

W 3 [4]

, regardless any case for

W 2

W 1

, or

W 0

. Note that

W 2

W 1

, and

W 0

are sums of two 4-bit values c and s, so they do not have the pattern 5’b11111. The maximum sum is 5’b11110 when both c and s have a maximum value of 4’b1111.

In the example described above, the window size is 4. We can let the window size be 2 or 8. Then we can use eight 2-bit adders or two 8-bit adders to determine the sign of

P = C + S

. The former has a shorter latency introduced by adders, but the logic equation for determining the sign is more complicated. On the other hand, the latter has a simpler logic equation but the adder has a longer latency to determine the sign.

For a 256-bit ECC, there are more options for the number of adder bits and the number of windows. In order to design a low-cost and high-performance circuit, we instigate the cost (the number of ALMs) and performance (the circuit frequency) of the sign detection circuit for using CSAs in 256-bit ECC on FPGA (Field-programmable gate array) chip. There are 7 configurations. The experimental results are shown in Table 2 and Figure 10.

In general, decreasing CPA bits increases frequency. However, in our simulations, the frequency of the last configuration (128 windows and 2-bit adder) is lower than the configuration with 64 windows and 4-bit adder. This is because the logic equation for sign detection becomes complicated. We can also see that larger adders require more ALMs.

Based on the experimental results, our 256-bit ECC implementations use 4-bit adders, so there are 64 windows (

4 \times 64 = 256

). A frequency of 392.46 MHz is measured when the FPGA chip implements only the sign detection circuitry. Implementing both the sign detection circuit and the CSAs results in a frequency of 343.76 MHz, lower than 392.46 MHz. The latency of the CSAs is the same as that of a 1-bit full adder. Clearly, it is less than that of the sign detection circuit. These experimental results imply that the frequency decreases as the circuit gets larger.

Note that the sign detection circuit itself is a combinational circuit. If we want to test its latency we can add registers on both the input and output sides. These registers are for testing purposes only and should be removed for ECC implementations. Otherwise, there will be a delay of two clock cycles, resulting in incorrect timing.

4. Hardware Implementations of Modular Multiplications and ECC

We have implemented modular multiplication algorithms and ECC in Verilog HDL. Unlike sequential program code in software, hardware modules can operate simultaneously. Considering data dependency, we must handle the synchronization between hardware modules using signals such as “start” and “ready”. This section describes these implementations and evaluates their cost and performance.

4.1. Hardware Implementations of Modular Multiplications

We have implemented the five modular multiplication algorithms described in the previous section. Since we use the 256-bit key proposed in Secp256k1 [4], the Verilog HDL code for modular multiplication also uses 256-bits. As an example of Verilog HDL code shown in Appendix A, the circuit calculates

p = a b \mod m

, where m is a 256-bit odd number and

{a, b} < m

, as described in Algorithm 8 (SSMM). “start” is a one clock cycle active signal that tells the module to start modular multiplication. “ready”, “busy”, and “ready0” are synchronization signals with other modules. The simulation waveform generated with ModelSim can be found in Figure A1. It is easy to develop Verilog HDL code for other modular multiplication algorithms by referring to the SSMM example code.

Table 3 gives the cost-performance of the five modular multiplication algorithms. The column of clock cycles shows the required number of clock cycles when executing the modular multiplication algorithm. The column of frequency (MHz) shows the frequency in MHz at which the circuit can work. The column of latency (

μ

s) shows the time in microsecond, that is calculated by dividing the clock cycles by the clock frequency. The column of ALMs shows the required number of adaptive logic modules. And the column of registers shows the required number of flip-flops.

Figure 11 plots the relative cost-performance of the modular multiplication algorithms to that of IMM. From this figure and table, we can see that SSMMCSA provides the highest frequency and lowest latency at the highest hardware cost.

4.2. Hardware Implementations of ECC

We have implemented ECC in affine, projective, and Jacobian coordinates using the IMM, SSMM, SSMMPRE, and SSMMCSA modular multiplication algorithms.

MMM is very efficient for modular exponentiation as the basic operation for RSA cryptography because it does not require modular (division). The modular exponentiation can be calculated by repeatedly calling MMM. However, transformations to the Montgomery domain are required before the calculations, and a transformation back to the normal domain is also required to obtain the final result. Domain transformation requires the value

q = R^{2} \mod m

which is calculated by costly modular operations. ECC in projective or Jacobian coordinates also does not require modular (division) during scalar point multiplication. Getting the key requires a final division to convert the point to affine coordinates. For these reasons, ECC using MMM is not implemented in this work.

Referring to Algorithm 5, in affine coordinates, the module of scalar point multiplication (scalarmult) invokes the module of point addition (addpoints) and the module of point doubling (doublepoint), as follows (simplified).

module scalarmult (clk, rst_n, start, x, y, d, m, a, rx, ry, ready);

addpointsap(clk,rst_n,start_ap,x1,y1,x2,y2,m,a,apx,apy,ready_ap);

doublepointdp(clk,rst_n,start_dp,x1,y1,m,a,dpx,dpy,ready_dp);

endmodule

Based on the start signal of module scalarmult, we can generate the start signals of start_ap for addpoints and start_dp for doublepoint. An important synchronization is to allow addpoints to start only after the previous iteration’s doublepoint has finished. After 256 iterations, the addition point [apx, apy] is the result point [rx, ry]. The modules of scalar point multiplication in projective and Jacobian coordinates invoke the point addition and doubling modules in projective and Jacobian coordinates, respectively.

In addpoints and doublepoint modules, the result point is calculated based on the algorithms of point addition and point doubling, as described in Algorithm 1 (or 3) and Algorithm 2 (or 4). These modules invoke

(1) modadd (modular addition),

(2) modsub (modular subtraction),

(3) modmul (modular multiplication), and

(4) modinv (modular inversion).

The first two modules are combinational circuits which take one clock cycle. The last two modules are sequential circuits for which we must generate the start signals. The Verilog HDL source codes of these two modules are given in the appendices Appendix A and Appendix B.

Referring to Table 1, in affine coordinates, x of the

Q_{a b}

and

Q_{b a}

can be used as the shared secret key for two parties. In projective and Jacobian coordinates, x must be calculated once by

X / Z

and

X / Z^{2}

respectively. Such calculations can be performed using modmul and modinv as shown in the appendices Appendix A and Appendix B. Table 4 gives the cost-performance of the ECC in affine, projective, and Jacobian coordinates using the IMM, SSMM, SSMMPRE, and SSMMCSA modular multiplication algorithms.

The relative cost-performance of the ECC in affine, projective, and Jacobian coordinates are plotted in the Figure 12, Figure 13 and Figure 14, where the cost-performance of IMM is set to 1.0.

As shown in Table 3, the circuit SSMMCSA has the highest frequency, but applying it to the ECC design does not achieve a higher frequency than the other circuits, such as SSMM. This is because, in ECC design, there are other modules that have longer latency than CSAs and sign detection. For example, modadd takes one clock cycle to perform an addition on two operands a and b, a subtraction (subtracting the modulus m from the sum), and a selection using a multiplexer based on the sign of the subtraction result. Such calculations take longer time than the CSAs and sign detection operations. Meanwhile, SSMMCSA circuit is larger than SSMM, this also decreases the frequency.

Figure 15 shows the relative latency to IMM for ECC implementations. All SSMM-related implementations have lower latency than IMM. We can see that the SSMM in projective coordinates has the lowest latency. This is because, SSMMPRE and SSMMCSA use a lot of hardware resources, which reduces the frequency. Interestingly, the latency in projective coordinates is lower than that in Jacobian coordinates. The formulas for point addition and point doubling in Jacobian coordinates look simpler than those in projective coordinates, but from Formula 24, the calculation of

Y_{r}

depends on

X_{r}

, resulting in a sequential execution. On the other hand,

Y_{r}

and

X_{r}

in projective coordinates can be calculated in parallel (Formula 21 and Formula 20).

The ECC implementations presented in this paper are based on the Secp256k1 curve. It is easy to use the NIST Secp256r1 (P-256) curve [4,11]. Our experimental results show that the cost performance of ECC implementations based on both curves is almost the same.

5. Concluding Remarks

This paper introduced ECC modular multiplication algorithms and their hardware implementations. Experimental results show that SSMMCSA (Shift-sub modular multiplication with CSAs and sign detection) has lower latency than SSMM (Shift-sub modular multiplication). We also investigated ECC implementations in affine, projective, and Jacobian coordinates using the IMM (Interleaved modular multiplication), SSMM, SSMMPRE (Shift-sub modular multiplication with advance preparation) , and SSMMCSA algorithms. Experimental results show that ECC in projective coordinates with SSMM has the lowest latency among all implementations.

The SSMMCSA circuit itself has a higher frequency than SSMM, but the ECC circuits using SSMMCSA have the same or even lower frequency than that using SSMM. The next challenge is to make the SSMMCSA ECC circuit faster than the SSMM ECC circuit.

Appendix A. Verilog HDL Code of Shift-Sub Modular Multiplication (SSMM)

This code implements Algorithm 8. Its block diagram is shown in Figure 6. The signal “start” is active for one clock cycle to tell the module to start modular multiplication. The signal “ready” remains active for one clock cycle to indicate that the modular multiplication result is available. “ready0” remains active until the multiplier is cleared. This signal is used to confirm the readiness of multiple modules to initiate another module’s operation. Figure A1 shows the simulation waveform generated with ModelSim.

Figure A1. Waveform of SSMM that calculates

p = a b \mod m

. The Verilog HDL code is given in Appendix A.

Figure A1. Waveform of SSMM that calculates

p = a b \mod m

. The Verilog HDL code is given in Appendix A.

Appendix B. Verilog HDL Code of Modular Inversion

This code is developed based on Algorithm 2.22 in [3]. The signal “start” is active for one clock cycle to tell the module to start modular inversion. The signal “ready” remains active for one clock cycle to indicate that the modular inversion result is available. “ready0” remains active until the multiplier is cleared. This signal is used to confirm the readiness of multiple modules to initiate another module’s operation. The testbench is listed below.

Figure B1 shows the simulation waveform generated with ModelSim.

Figure B1. Waveform of modular inversion that calculates

c = b a^{- 1} \mod m

. The Verilog HDL code is given in Appendix B.

Figure B1. Waveform of modular inversion that calculates

c = b a^{- 1} \mod m

. The Verilog HDL code is given in Appendix B.

References

Koblitz, N. Elliptic curve cryptosystems. Mathematics of Computation 1987, 48, 203–209. https://www.ams.org/journals/mcom/1987-48-177/S0025-5718-1987-0866109-5/S0025-5718-1987-0866109-5.pdf.
Miller, V.S. Use of Elliptic Curves in Cryptography. In Proceedings of the Advances in Cryptology — CRYPTO ’85 Proceedings, Berlin, Heidelberg, 1986; pp. 417–426. https://link.springer.com/content/pdf/10.1007/3-540-39799-X_31.pdf?pdf=inline%20link.
Hankerson, D.; Menezes, A.; Vanstone, S. Guide to Elliptic Curve Cryptography; Springer-Verlag: New York Inc, 2004. [CrossRef]
Certicom Corp. Standards for Efficient Cryptography. SEC 2: Recommended Elliptic Curve Domain Parameters; http://www.secg.org/sec2-v2.pdf, 2010.
Barker, E.; Chen, L.; Roginsky, A.; Vassilev, A.; Davis, R. SP 800-56A Rev. 3, Recommendation for Pair-Wise Key-Establishment Schemes Using Discrete Logarithm Cryptography; National Institute of Standards and Technology, 2018. https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-56Ar3.pdf.
Blakely, G.R. A Computer Algorithm for Calculating the Product AB Modulo M. IEEE Transactions on Computers 1983, C-32, 497–500. [CrossRef]
Montgomery, P.L. Modular Multiplication Without Trial Division. Mathematics of Computation 1985, 44, 519–521. https://www.ams.org/journals/mcom/1985-44-170/S0025-5718-1985-0777282-X/S0025-5718-1985-0777282-X.pdf.
Li, Y.; Chu, W. Shift-Sub Modular Multiplication Algorithm and Hardware Implementation for RSA Cryptography. In Proceedings of the 17th International Conference on Information Assurance and Security, Lecture Notes in Networks and Systems, Cham, 2021; pp. 541–552. [CrossRef]
Li, Y.; Chu, W. Verilog HDL Implementation for an RSA Cryptography using Shift-Sub Modular Multiplication Algorithm. Journal of Information Assurance and Security 2022, 17, 113–121. http://www.mirlabs.org/jias/secured/Volume17-Issue3/Paper11.pdf.
Bunimov, V.; Schimmler, M. Area and time efficient modular multiplication of large integers. In Proceedings of the Proceedings IEEE International Conference on Application-Specific Systems, Architectures, and Processors, 2003, pp. 400–409. [CrossRef]
Chen, L.; Moody, D.; Regenscheid, A.; Robinson, A.; Randall, K. Recommendations for Discrete Logarithm-based Cryptography: Elliptic Curve Domain Parameters. NIST Special Publication NIST SP 800-186, 2023. https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-186.pdf. [CrossRef]

Figure 1. ECDH and ECC algorithms. The ECDH Key Exchange invokes a Scalar Point Multiplication that uses two computations - Point Addition and Point Doubling. Four primitive modular calculations (Addition, Subtraction, Multiplication, and Inversion) are used for these two computations.

Figure 2. Point addition

R = P + Q

on an elliptic curve

y^{2} = x^{3} + a x + b

in the real number field.

Figure 2. Point addition

R = P + Q

on an elliptic curve

y^{2} = x^{3} + a x + b

in the real number field.

Figure 3. Point doubling

R = 2 P

on an elliptic curve

y^{2} = x^{3} + a x + b

in the real number field.

Figure 3. Point doubling

R = 2 P

on an elliptic curve

y^{2} = x^{3} + a x + b

in the real number field.

Figure 4. Block diagram of interleaved modular multiplication (IMM). It implements Algorithm 6.

Figure 5. Block diagram of Montgomery modular multiplication (MMM). It implements Algorithm 7.

Figure 6. Block diagram of shift-sub modular multiplication (SSMM). It implements Algorithm 8.

Figure 7. Block diagram of shift-sub modular multiplication with advance preparation (SSMMPRE). It implements Algorithm 9.

Figure 8. Block diagram of shift-sub modular multiplication with CSAs and sign detection (SSMMCSA). It implements Algorithm 10.

Figure 9. Determining the sign of

P = C + S

based on the 17-bit outputs C and S of CSAs. We divide 17 bits to four windows. There are five bits in the left-most window. Each of other three windows has four bits. The C and S are inputs; others are outputs of adders.

W 3 [4]

indicates the sign except for Case 0, Case 1, or Case 2, where the sign is the inverse of

W 3 [4]

when

W 3 [3 : 0] = = 4

’b1111.

Figure 9. Determining the sign of

P = C + S

W 3 [4]

indicates the sign except for Case 0, Case 1, or Case 2, where the sign is the inverse of

W 3 [4]

when

W 3 [3 : 0] = = 4

’b1111.

Figure 10. Cost and performance of sign detection with different configurations.

Figure 11. Cost performance comparison of modular multiplication algorithms.

Figure 12. Cost performance comparison of ECC in affine coordinates.

Figure 13. Cost performance comparison of ECC in projective coordinates.

Figure 14. Cost performance comparison of ECC in Jacobian coordinates.

Figure 15. Relative latency of ECC. The SSMM in projective coordinates has the lowest latency.

Table 1. Elliptic curve Diffie-Hellman key exchange.

Expose an elliptic curve $y^{2} = x^{3} + a x + b \mod m$ and a point P on the elliptic curve to the world
Alice	Bob
Generate a secret $d_{a}$	Generate a secret $d_{b}$
Calculate $Q_{a} = d_{a} P$ (Algorithm 5)	Calculate $Q_{b} = d_{b} P$ (Algorithm 5)
Expose $Q_{a}$	Expose $Q_{b}$
Get $Q_{b}$ from Bob	Get $Q_{a}$ from Alice
Calculate $Q_{a b} = d_{a} Q_{b}$ (Algorithm 5)	Calculate $Q_{b a} = d_{b} Q_{a}$ (Algorithm 5)
Use x of $Q_{a b}$ as the key	Use x of $Q_{b a}$ as the key

Table 2. Cost and performance of sign detection with different configurations. The FPGA chip device is Cyclone V / 5CGXFC7D7F31C8.

Windows × Adder bits	$2 \times 128$	$4 \times 64$	$8 \times 32$	$16 \times 16$	$32 \times 8$	$64 \times 4$	$128 \times 2$
ALMs	129	63	33	19	9	7	3
Frequency (MHz)	205.30	248.57	282.09	326.90	342.58	392.46	355.75

Table 3. Comparison of modular multiplication algorithms. SSMM is 1.76 times faster than IMM and SSMMCSA is 3.21 times faster than IMM. The FPGA chip device is Cyclone V / 5CGXFC7D7F31C8.

Algorithm	Clock cycles	Frequency (MHz)	Latency ( $μ$ s)	ALMs	Registers
IMM	258	35.15	7.34	657	268
MMM	258	59.12	4.36	756	276
SSMM	258	61.92	4.17	606	527
SSMMPRE	258	66.16	3.90	847	1043
SSMMCSA	259	113.29	2.29	1119	1043

Table 4. Comparison of ECC in different coordinates with different modular multiplication algorithms. The FPGA chip device is Cyclone V / 5CGXFC7D7F31C8.

Algorithm	Clock cycles	Frequency (MHz)	Latency (ms)	ALMs	Registers
ECC in Affine Coordinates
IMM	402,146	13.98	28.77	14,816	6,995
SSMM	402,146	20.41	19.70	15,032	8,085
SSMMPRE	402,146	20.69	19.44	16,723	12,743
SSMMCSA	403,166	20.40	19.76	18,416	12,834
ECC in Projective Coordinates
IMM	396,550	17.28	22.95	30,291	15,352
SSMM	396,550	29.49	13.45	29,878	23,434
SSMMPRE	396,550	28.54	13.89	37,545	41,902
SSMMCSA	398,080	21.31	18.68	46,831	47,984
ECC in Jacobian Coordinates
IMM	369,079	14.31	25.79	27,884	14,068
SSMM	369,079	21.15	17.45	28,978	21,708
SSMMPRE	369,079	20.95	17.62	35,864	39,024
SSMMCSA	370,502	17.14	21.62	44,591	45,298

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

MDPI Initiatives

Important Links

Choose an area of interest and we will send you notifications of new preprints at your preferred frequency.

Disclaimer

Hardware Implementations of Elliptic Curve Cryptography Using Shift-Sub Based Modular Multiplication Algorithms

Abstract

1. Introduction

2. Elliptic Curve Cryptography Algorithms

2.1. ECC Point Addition and Doubling in Affine Coordinates

2.1.1. ECC Point Addition in Affine Coordinates

2.1.2. ECC Point Doubling in Affine Coordinates

2.2. ECC Point Addition and Doubling in Projective Coordinates

2.3. ECC Point Addition and Doubling in Jacobian Coordinates

2.4. Modular Inversion

2.5. Scalar Point Multiplication

2.6. Elliptic Curve Diffie-Hellman Key Exchange

3. Modular Multiplication Algorithms

3.1. Interleaved Modular Multiplication Algorithm

3.2. Montgomery Modular Multiplication Algorithm

3.3. Shift-Sub Modular Multiplication Algorithm

3.4. Shift-Sub Modular Multiplication with Advance Preparation Algorithm

3.5. Shift-Sub Modular Multiplication with CSAs and Sign Detection Algorithm

4. Hardware Implementations of Modular Multiplications and ECC

4.1. Hardware Implementations of Modular Multiplications

4.2. Hardware Implementations of ECC

5. Concluding Remarks

Appendix A. Verilog HDL Code of Shift-Sub Modular Multiplication (SSMM)

Appendix B. Verilog HDL Code of Modular Inversion

References

MDPI Initiatives

Important Links

Subscribe