This paper presents an efficient crypto processor architecture for key agreement using ECDH (Elliptic-curve Diffie Hellman) protocol over

In recent years, demand for the secure transmission of reliable information over any network is increasing dramatically. Various applications such as healthcare [

The security strength of ECC relies on solving the discrete logarithms problem [

There are different settings in the literature to implement ECC protocols. These settings indicate choices for (i) two different basis (polynomial and Gaussian normal) for the representation of the initial & final point and (ii) availability for two different coordinate systems (affine and projective). The polynomial basis is considered in this paper as it is more beneficial to acquire faster modular multiplications while a Gaussian normal basis is valuable where frequent squares are needed to compute [

Several hardware architectures are published in the state-of-the-art to accelerate the performance of ECC operations [

In [

In [

In [

There are very few hardware designs where key-agreement using Elliptic-curve Diffie Hellman is considered for the optimizations on FPGA [

The related implementations, reported in [

Our contributions and employed strategies to reduce area and power values are as follows:

The RTL (register transfer level) implementations of proposed crypto processor designs (DESIGN-I and DESIGN-II) are implemented in a Verilog (HDL). The results are reported on Xilinx Virtex-7 (xc7vx690t-3ffg1930) FPGA device. For DESIGN-I, the utilized Slices, LUTs (look-up-tables) and FFs (flip flops) over

The structure of this paper is organized as: Section 2 presents the essential information for the computation of the ECDH scheme and the PM operation, respectively. The proposed crypto processor architecture for generating a public key (DESIGN-I) and shared secret (DESIGN-II) is provided in Section 3. The implementation results are presented in Section 4. Finally, Section 5 concludes the paper.

The mathematical structure of the ECDH protocol is described in Section 2.1. Moreover, the sequence of instructions for the computation of Elliptic-curve PM and the corresponding point addition and double formulas are provided in Section 2.2.

A complete layout of the ECDH protocol for key agreement (or shared key generation) between two distinct users is illustrated in

As shown in

The PM operation is the computation of

In

The inputs to Algorithm 1 are initial point

The proposed crypto processor architecture for ECDH is shown in

The ECDH controller is responsible for generating control signals for the routing multiplexers

The ECDH controller is implemented by adopting a finite state machine (FSM) model. It contains three states, i.e., IDLE, PKG and SKG. As the name implies, IDLE is the initial state of the crypto processor that determines none of the operations is under computation. The PKG shows that the crypto processor is processing a public key. Finally, the SKG means that the processor architecture is computing a shared key. A two-bit input signal “

The purpose of

The ECPM unit consists of (i) an array of registers, (ii) three routing multiplexers, (iii) adder, squarer, multiplier, polynomial reduction & inversion units and (iv) an ECPM controller. The description of these units are given in the next subsequent sections:

The register array in the proposed crypto processor architecture is an

The three routing multiplexers in the proposed architecture include

The arithmetic operators, i.e., addition (ADDER), squaring (SQUARE) and multiplication (MULT), perform the PM operation of the ECPM unit. The implementation of a polynomial addition in

The efficiency of a polynomial multiplier specifies the performance of an ECC-based crypto processor architecture. Based on the available literature, there are four possibilities to implement a multiplier circuit. These possibilities are (i) bit-serial, (ii) digit-serial, (iii) bit-parallel and (iv) digit-parallel. Each aforementioned architectural choice has certain differences. When we consider only the computational cost in terms of clock cycles, the bit-serial multipliers require

The size of resultant polynomials generated after MULT and SQUARE units are

As shown in Algorithm 1, the reconversion from projective to affine requires two polynomial inversion computations. There are several inversion methods in the literature to perform the multiplicative inverse of the polynomial(s). Therefore, the Itoh Tsujii algorithm is extensively employed in the state-of-the-art because it needs multiplications and square operations for computation [

The FSM-based ECPM controller is responsible to perform the PM operation of

The proposed crypto processor will remain in state 0 (IDLE state).

It starts the computation once it receives the

Under different conditions, when the

For the proposed DESIGN-I,

In

As shown in

The details for the implementation results and comparison to state-of-the-art is given in Section 4.1 and Section 4.2, respectively.

Our DESIGN-I and DESIGN-II over

The area breakdown for building blocks in DESIGN-I and DESIGN-II are shown in

Building Blocks | Used Resources | ||
---|---|---|---|

Slices | LUTs | FFs | |

ADDER | 77 | 82 | 0 |

SQUARE + NIST Reduction | 38 | 39 | 0 |

MULT (shift and add) + NIST Reduction | 4429 | 10531 | 0 |

Register array of |
652 | 1956 | 1304 |

Sum of resources | 5196 | 12608 | 1304 |

As shown in

Concerning the sum of resources of building blocks of our DESIGN-I and DESIGN-II, apart from the routing multiplexers, the reported Slices (5196) and LUTs (12608) values in

The detailed implementation results of our proposed DESIGN-I and DESIGN-II are shown in

Design | Utilized Area | Timing Characteristics | Total Power (in |
|||||
---|---|---|---|---|---|---|---|---|

Slices | LUTs | FFs | CCs | Frequency (in |
PM latency (in |
ECDH latency (in |
||

DESIGN-I | 3983 | 10639 | 2389 | 163902 | 296 | 553.7 | – | 29 |

DESIGN-II | 4037 | 11713 | 2427 | 327804 | 280 | 553.7 | 1170.7 | 57 |

As shown in

The total sum of Slices and LUTs of building blocks is 1.30 (ratio of 5196 with 3983) and 1.18 (ratio of 12608 with 10639) times higher than the Slices and LUTs of our DESIGN-I. Similarly, when the area resources of DESIGN-II are considered for comparison, the total sum of Slices and LUTs of building blocks is 1.28 (ratio of 5196 with 4037) and 1.07 (ratio of 12608 with 11713) times higher. This indicates that the use of different optimization strategies for synthesis of DESIGN-I and DESIGN-II results in lower resources. The reason is that we set the area optimization as synthesis goal in a Vivado IDE tool.

As far as the utilized area is concerned, DESIGN-I takes only the 3983, 10639 and 2389 Slices, LUTs and FFs. The reported FPGA area values in terms of Slices, LUTs and FFs for DESIGN-II are 4037, 11713 and 2427 that are comparatively 1.01 (ratio of 4037 with 3983), 1.10 (ratio of 11713 with 10639) and 1.01 (ratio of 2427 with 2389) times higher than our DESIGN-I. The reason for higher hardware resources in DESIGN-II is the use of two additional routing multiplexers (

The comparison with the most compatible state-of-the-art architectures, published in [

Ref #. | PM Algorithm | Device | FPGA Slices | Frequency (in |
Latency (in |
---|---|---|---|---|---|

Architectures for PM computation of ECC | |||||

[ |
Montgomery | Virtex-7 | 2207 | 369 | 10.73 |

[ |
Double and Add | Virtex-5 | 3122 | 288 | 24.5 |

[ |
Montgomery | Virtex-7 | 3657 | – | 25.3 |

[ |
Montgomery | Virtex-5 | 473 | 359 | 110 |

[ |
Montgomery | Virtex-7 | 4150 | 352 | 3.18 |

DESIGN-I | Montgomery | Virtex-5 | 3126 | 301 | 544.5 |

DESIGN-II | Montgomery | Virtex-7 | 3983 | 296 | 553.7 |

Hardware designs for ECDH scheme | |||||

[ |
Montgomery | Virtex-5 | 35102 LUTs | 125 | 8.88 |

[ |
Montgomery | Virtex-7 | 1809 | 62.5 | 4.13 |

DESIGN-I | Montgomery | Virtex-5 | 11713 LUTs | 289 | 1.13 |

DESIGN-II | Montgomery | Virtex-7 | 4037 | 280 | 0.58 |

Notes:

As compared to the dedicated PM architecture of [

As shown in

On Virtex-7 FPGA, a 2-stage pipelined PM architecture of [

As compared to the PM architecture of [

The accelerator architectures of [

This paper has presented a key-agreement architecture for the ECDH scheme over

We extend our gratitude to the Estonian Aviation Academy, Tartu, Estonia for supporting and funding this research work.

^{163}) for low-area applications on FPGA