Abstract
In this paper we describe a novel approach for
speeding up the computations of characteristic 2
elliptic curve cryptography. Using a projective space
such as the Lopez-Dahab space [1] for representing
point coordinates we accelerate point additions and
point doublings by introducing a novel way for
multiply elements in finite fields of the form GF(2m).
Our technique uses a CPU instruction for carry-less
multiplication (GFMUL) and single iteration
Karatsuba-like formulae [2] for computing the carryless product of large degree polynomials in GF(2). It
then performs the reduction of the carry-less product
of these polynomials by taking into account the fact
that many curves specify fields with irreducible
polynomials which are sparse. For example NIST
curves specify polynomials with either three terms
(trinomials) or five terms (pentanomials). We
demonstrate results from a prototype implementation
showing that our technique speeds up Elliptic Curve
Diffie Hellman based on the NIST B-233 curve by 55%
in software on a 3.6 GHz Pentium 4 processor. If a 3
clock latency GFMUL instruction is introduced to the
CPU then the acceleration factor becomes 5.2X. We
also show that further software optimizations have the
potential to further increase the speedup beyond 10X.
1. Introduction
Elliptic curve cryptography, originally proposed by
Koblitz [3] and Miller [4] has recently gained
significant interest from the industry [5-7] and
government organizations [8] due to its ability to
support key establishment and digital signature
operations like RSA with similar cryptographic
strength, but much smaller key sizes. The main idea
behind elliptic curve cryptography is that points in an
elliptic curve form an additive group. A point can be
added to itself many times where the number of
additions can be a very large number. Such operation
is often called point times scalar multiplication or
simpler point multiplication. The suitability of
265
3. Carry-less Multiplication
Carry-less multiplication, also known as Galois
Field (GF(2)) Multiplication is the operation of
multiplying two numbers without generating or
propagating carries. In the standard integer
multiplication the first operand is shifted as many
times as the positions of bits equal to 1 in the second
operand. The product of the two operands is derived
by adding the shifted versions of the first operand with
each other. In carry-less multiplication the same
procedure is followed except that additions do not
generate or propagate carry. In this way, bit additions
are equivalent to the exclusive OR (XOR) logical
operation.
Carry-less multiplication is formally defined as
follows: let the two operands be A, B, of size n bits
each. Let the number A be the following array of bits
2. Related Work
Efficient implementation of characteristic 2
elliptic curve cryptosystems is a subject that has been
studied extensively by the cryptographic community.
An instruction for carry-less multiplication was first
proposed in [18] for speeding up computations in finite
fields. Other efficient Galois field multiplication
techniques are described in [19-21]. Our work differs
form the state-of-the-art by the fact that Karatsuba
multiplication is performed in a single iteration and the
reduction algorithm avoids the costly word-by-word
processing of the input.
Our single iteration extensions to Karatsuba [15]
are based on novel one-to-one mappings between
products and complete graphs. A related technical
report from the University of Ruhr, Bochum,
Germany by Weimerskirch and Paar [11] describes
a one-iteration technique that needs significantly
more scalar multiplications than ours (i.e.,
n(n+1)/2 multiplications with n being the operand
size). Another paper recently published by Peter
Montgomery [12] describes a brute-force
approach for generating five, six, and seven
Karatsuba-like formulae. These formulae are
generated from a computer program that scans the
space of all possible formulae and returns the
fastest algorithmically correct ones. Last, a paper
by Cetin Koc et. al. from Oregon Sate University
[13] describes a less recursive variant of Karatsuba
where the size of the input operands needs to be a
power of 2.
Our reduction algorithm is also novel and is based
on an extension of the Barrett [16] reduction algorithm
to modulo-2 arithmetic or the Feldmeier CRC
generation algorithm [17] to dividends and divisors of
arbitrary size. It takes into account the fact that the
A = [a n 1 a n 2 ... a 0 ]
(1)
(2)
C = [c2 n 1 c2 n 2 ... c0 ]
(3)
ci = a jbi j
j =0
(4)
ci = a j bi j
j =i n+1
(5)
266
(6)
267
Input Operands
a 3 x 3+ a 2 x 2+ a 1 x + a 0
Output
b 3 x 3+ b 2 x 2+ b 1 x + b 0
Initial
0
where
c 0=a0b 0
c 1=a0b 1+a 1b 0
c 2=a0b 2+a 2b 0+a 1b 1
c 3=a0b 3+a 3b 0+a 2b 1+a1b 2
c 4=a1b 3+a 3b 1+a 2b 2
c 5=a3b 2+a 2b 3
c 6=a3b 3
a0
a1
a2
a3
b0
b1
b2
b3
(a 0 + a 1 ) (b0 + b 1)
(a 2 + a 3 ) (b2 + b 3)
Scalar multiplications
(a 0 + a 2 ) (b0 + b 2)
(a 1 + a 3 ) (b1 + b 3)
(a 0 + a 1 + a 2 + a 3) (b 0 + b 1 + b 2 + b 3)
Subtractions
1
(a 0 + a 1 ) ( b 0 + b 1 )
0
(a 0 + a 2 ) ( b 0 + b 2 )
1
(a 1 + a 3 ) ( b 1 + b 3 )
3
(a 2 + a 3 ) ( b 2 + b 3 )
[a 0b 0 +a 1b 1]
0
2
[a 0b 0 +a 2b 2]
0
1
[a 1b 1 +a 3b 3]
0
2
[a 2b 2 +a 3b 3]
1
a 0b 1 +a 1b 0
2
a 0b 2 +a 2b 0
3
a 3b 1 +a 1b 3
3
a 3b 2 +a 2b 3
(a 0 + a 1 + a2 + a 3) ( b 0 + b 1 + b2 + b 3)
[(a 0 + a 1) (b 0 + b 1) + ( a 2 + a3) (b 2 + b 3) +
(a0 b2 +a 2b 0 + a 1b 3 +a 1b 3)]
268
3
a 0b 3 + a3b 0 + a 1b 2 + a 2b 1
p( x) = c( x) xt mod g ( x)
p( x) = g ( x) q( x) mod xt = Lt ( g ( x) q( x))
Now we define:
g ( x) = g t x t + g * ( x)
(7)
p ( x) = Lt ( g ( x) q( x)) = Lt (q ( x) g * ( x) + q ( x) gt x t )
p( x) = Lt (q( x) g * ( x))
(9) c( x) xt + s = g ( x) q( x) x s + p( x) x s
(8)
(15)
Let
x t + s = g ( x) q + ( x) + p + ( x)
(16)
(9)
(15)
+
+
c( x) g ( x) q ( x ) + c( x ) p ( x)
(16)
c( x) x = g ( x) q ( x) + p ( x)
(14)
c( x) = cs 1x s 1 + cs 2 x s 2 + ... + c1x + c0 ,
(13)
p( x) = c( x) xt mod g ( x) = g ( x) q( x) mod xt
(12)
Here,
c(x) is a polynomial of degree s-1 with
coefficients in GF(2), representing the most
significant bits of the carry-less product.
t is the degree of the polynomial g.
g(x) is the irreducible polynomial of the finite
field used
p ( x ) = pt 1 x t 1 + pt 2 x t 2 + ... + p1 x + p0 ,
and
g ( x) = gt xt + gt 1xt 1 + ... + g1x + g0
(11)
(17)
= g ( x) q( x) x s + p( x) x s
and
(17) M s (c ( x) g ( x) q + ( x) + c ( x) p + ( x ))
(10)
= M s ( g ( x) q ( x) x s + p ( x) x s )
(18)
269
in the left and right hand side of equation (18) are not
affected by the polynomials cp+ and pxs. Hence,
(18) M s (c( x) g ( x) q + ( x))
= M s ( g ( x) q ( x) x s )
WORD_TYPE
WORD_TYPE
WORD_TYPE
WORD_TYPE
WPRD_TYPE
WORD_TYPE
WORD_TYPE
WORD_TYPE
WORD_TYPE
(19)
q = M s (c( x) q + ( x))
(21)
p( x) = Lt ( g * ( x) M s (c( x) q + ( x)))
S_0, S_1, S_2, S_3, S_4, S_5, S_6, S_7, S_8, S_9;
P_0_0, P_0_1, P_1_0, P_1_1;
P_2_0 P_2_1, P_3_0, P_3_1, P_4_0, P_4_1;
P_5_0, P_5_1, P_6_0, P_6_1, P_7_0;
P_7_1, P_8_0, P_8_1;
D_0_0, D_0_1, D_1_0, D_1_1;
L_0, L_1, L_2, L_3;
U_0, U_1, U_2, U_3;
V_0, V_1, V_2, V_3;
(22)
//finally
result[0]
result[1]
result[2]
result[3]
result[3]
return 1;
270
5. Evaluation
In Table 1 we present the performance of the point
times scalar multiplication operation which constitutes
the main part of an Elliptic Curve-based Diffie
Hellman key exchange using the Sun implementation
of OpenSSL, our software improvements, and our
proposed hardware assist. The processor used is a
Pentium 4 processor at 3.6 GHz. As can be seen our
approach offers substantial acceleration which ranges
from 55% to 5.2X depending on whether a GFMUL
instruction is introduced to the CPU or not. Further
optimizations can remove much of the application
level overhead (e.g, OpenSSL BN structure
management and function call overhead) ideally
accelerating the prototype by 66X. Practically, we
believe that a 10X acceleration is possible with further
refinement of the code. Such exercise, however, is left
for future work.
We believe that the proposed approach has some
advantages over existing methods because it is highly
flexible and enables high performance elliptic curve
processing. This can help opening new markets for
companies (e.g., packet by packet public key
cryptography). Furthermore the improved method can
even influence companies to take leadership steps in
the crypto and networking. For example, the proposed
instruction
accelerates
both
Elliptic
Curve
cryptography and the GCM mode of AES. In this way,
high performance public key operations and message
authenticity can be accelerated using the same
hardware assists. Currently used hash family SHA*
scales badly because its state increases with digest
length. This gives strong motivation for companies to
move to AES-based authenticity schemes and for the
penetration of AES-based schemes into the security
products market.
Implementation
OpenSSL
220,000 clocks
1,220,000 clocks
70,000 clocks
1,070,000 clocks
Implementation
Sun/OpenSSL
77%
55%
21X
4.6X
66X
5.2X
6. Conclusion
Acknowledgement
We
presented
a
new
approach
for
implementing Galois field multiplication for
Characteristic 2 Elliptic Curves. The new
ingredients are an efficient reduction method
combined with using a GFMUL instruction,
currently not part of the instruction set of
processors and single iteration extensions to the
Karatsuba multiplication algorithm. Currently,
software implementations do not use Karatsuba and
Barrett because the cost of such algorithms when
implemented the straightforward way. Hardware
approaches found in cryptographic processors
271
References
272