Anda di halaman 1dari 5

Power and area efficient ASIC implementation of

AES algorithm
K. Vidya Sagar, R. Thilagavathy
Dept. of Electronics & Communication,
National Institute of Technology,
Trichy, India-620015.
Mobile: 9042711438
e-mail: sagar123.k@gmail.com

Abstract—This paper presents a power and area efficient implemented with the goal of reducing area and power
hardware implementation of the Advanced Encryption Standard consumption to use in resource critical applications.
(AES) algorithm. In this a comprehensive study of different
implementations of the AES S-box with respect to timing (i.e. Designs typically aiming for high throughput based on 128-
critical path), silicon area and power consumption are evaluated bit data paths have been widely reported, and they provide
to choose a suitable S-Box. This architecture utilizes a split 8-bit highly efficient solutions for gigabits-per-second throughputs.
data path between key and state processing, with resource sharing Many of the attempts at application-specific integrated circuit
of the S-Box operation to minimize area and power consumption. (ASIC) designs for the AES considered only a 32-bit data path
The AES core consumes an area of 4.59k and 143 µW of power at as the minimum. A good example is the work of Satoh et al.
10 MHz in 0.18 µm standard-cell CMOS technology. [13]. However, in the interests of lower power, an 8-bit data
path has been explored more recently. In a previous design [7],
Keywords-Advanced Encryption Standard (AES); Encryption; the authors explored the application of an application specific
Substitute Byte (S-Box); MixColumns; Application Specific instruction processor (ASIP) for a field-programmable gate
Integrated Circuit (ASIC); array (FPGA), which utilized a truly 8-bit data path. Although
I. INTRODUCTION the design in [7] was the smallest reported design for the AES
on an FPGA, it still required 13546 cycles to perform the
In today’s digital world, encryption is emerging as a encryption (including key expansion), which can be too many
disintegrable part of all communication networks and for some of the applications already cited before. In this paper,
information processing systems, for protecting both stored and an area and power efficient architecture is presented for the
in transit data. Encryption is the transformation of plain data implementation of the AES using a 8-bit data path between key
(known as plaintext) into unintelligible data (known as cipher and state processing, with resource sharing of the SubBytes
text) through an algorithm referred to as Cipher. There are operation, which yields significant improvements over the
numerous encryption algorithms that are now commonly used previous best-case designs.
in computation, but the National Institute of Standards and
Technology (NIST) selected Rijndael and named it as the The total work in this paper can be divided into two major
Advanced Encryption Standard (AES) on 26th November 2001 parts. First part is dedicated for studying different S-Box
when Data Encryption Standard (DES) was no longer secure implementation methods available in the literature and
for protecting sensitive information. proposed two new methods for realizing S-Box which
minimizes area and power consumption. In the second part,
The AES algorithm is extremely flexible and may be importance is given for selecting an architecture which utilizes
implemented using different hardware architectures from a split 8-bit data path between key and state processing with
deeply pipelined loop-unrolled implementations that support resource sharing of the SubBytes operation, implementing a
many gigabits-per-second throughputs to iterative single-round low resource circuit for Mixcolumns and an optimized
implementations through resource-shared designs, which implementation of controller.
sacrifice throughput in favour of saving power and area. There
is a continued demand for better hardware implementation and The remainder of this paper is organized as follows.
there exists a number of application areas that seek even lower Section II briefly describes the AES followed by different
resource designs for block ciphers such as the AES, which methods for implementing S-Box in Section III. Details of
require modest data rates of less than 1 Mbps. Inductively implementing AES core architecture is given in Section IV.
powered RF identification (RFID), wireless sensor networks The synthesis results of the design are given in Section V. The
(WSNs), and smart cards can have clock frequencies, derived paper ends by drawing some conclusions in Section VI.
from the energizing RF, as low as 100 kHz with power
limitations of a few microwatts, latency requirements of a few II. AES ALGORITHM
milliseconds, together with an area limitation of a few thousand The AES algorithm is a subset of a much larger
gates are some of the resource critical applications. In this encryption algorithm known as Rijndael. NIST announced the
paper a hardware implementation of the AES algorithm is
Rijndael algorithm as the winner due to the best overall score circuits for arithmetic in GF(28) [5], constructions using
in security, performance, efficiency, implementation capability hardware look-up tables, and dedicated low-power solutions
and simplicity. The AES algorithm is a symmetric-key iterative [2], all of which have their specific advantages and
block cipher, and operates on a data block of length 128 bits. The disadvantages with respect to area, delay, and power
key length can be 128, 192, or 256 bits depending on the desired consumption. In this work S-Boxes that fall into above three
level of security. The algorithm repeats for Nr number of rounds. categories were examined and two new methods were
The value of Nr depends on the key length and its value can be proposed which consume less area and power consumption
10, 12, or 14 for a key length of 128,192, or 256 bits, than existing methods.
respectively. The data block of 128 bits is divided into 16 bytes The simplest design in our comparison is a straightforward
and is viewed as a rectangular block array of bytes. This array implementation of a hardware look-up table. Implementing S-
has four rows. The number of columns is denoted by Nb and is box directly as a LUT, using either ROM or letting the synthesis
equal to the block length divided by 32. The cipher key is tool to build combinatorial logic from the truth table does not
similarly viewed as a rectangular array with four rows. The result in the optimal solution in terms of gate area in ASIC
number of columns of the cipher key is denoted by Nk and is technologies. This approach can efficiently be used in FPGAs.
equal to the key length divided by 32. Both forward (encryption) To understand the complexity of this type of S-Box in ASIC, it is
and inverse (decryption) operations can be performed using AES implemented using Verilog HDL. This approach will be
algorithm. denoted as HW-LUT.
AES supports different modes of operation and both Second method of implementing S-Box is by using
encryption and decryption operations are required to implement combinational logic. Here The S-box is constructed by first
the electronic code book (ECB) mode for the cipher. However, finding multiplicative inverse (MI) of the input byte in GF(28)
there are other modes that require only the AES encryption in with respect to the polynomial p(x)= x8+x4+x3+x+1, and then
order to function, providing both encryption and decryption applying affine transformation. The affine transformation
operations. Such modes include counter (CTR), output feedback involves multiplication with a matrix and adding (exclusive-
(OFB), and cipher feedback (CFB). These modes are approved Oring) a constant 63H. Since the computation of MI in GF(28)
by the NIST [4] for use with the AES. A more recent is hardware intensive operation, it is done by decomposing
authenticated mode, counter with CBC-MAC (CCM), may also more complex GF(28) into lower order fields of GF(21), GF(22)
be supported using only the encryption primitive. and GF((22)2) by using irreducible polynomials given in [5].
In the encryption process in AES algorithm, following an Any element in GF(28) may be represented as bx + c given
initial AddRoundKey step, each round except the final round an irreducible polynomial of x2 + Ax + B, where b is the most
consists of four transformations namely the SubBytes, ShiftRows, significant nibble while c is the least significant nibble. From
MixColumns and the AddRoundKey. The final round has only here, the multiplicative inverse can be computed using the
the SubBytes, ShiftRows and AddRoundKey transformations. equation given below.
Considering the state as a 4×4 matrix of 8-bit values, the
operators may be conveniently defined. The ShiftRows operator bx c b B bcA c x c bA b B
rotates each row of the State to the left using a specific offset. bcA c (1)
The offset equals the row index (starting at 0), which means that
the first row is not rotated at all and the last row is rotated by From [5], the irreducible polynomial that was selected was
three bytes to the left. SubBytes operation is a non-linear byte x2 + x + λ. Since A = 1 and B = λ, then the equation could be
substitution that operates independently on each byte of the State simplified to the form as shown below.
using a substitution table (S-box). The MixColumns operator
performs a set of fixed-value GF(28) multiplications and bx c b b λ c b c x c b b λ
essentially operates on columns of the state. The final operation c b c (2)
AddRoundKey is simply the bitwise XOR of the state and the
RoundKey. The KeyExpansion utilizes four SubBytes operations The above equation indicates that there are multiply,
followed by some GF additions to yield the set of RoundKeys. addition, squaring and multiplication inversion operations in
The expansion operation also incorporates a byte-wise rotation GF(24). Complete details on how to implement squaring,
and addition of a round-specific constant Rcon. These constants multiplication, addition and multiplicative inversion in GF(24)
can be derived using finite-field doubling. can be found in [5].
Third method for implementing S-Box is known as hybrid-
III. DIFFERENT IMPLEMENTATIONS FOR S-BOX
LUT, which uses LUT approach for implementing inversion in
The S-box is a costly and performance critical building GF(24) and rest of the circuit is implemented using
block of the AES algorithm In addition, the S-box also combinational logic. This proposed method resulted in less
impacts area and power consumption of AES hardware. area and power consumption than first two methods.
Therefore, the AES S-box has been a subject of intensive Fourth method, Decoder-Switch-Encoder (DSE) S-Box is
research in recent years, which has led to a rich literature on proposed by Betroni for achieving low power solution [2]. As
efficient S-box design and implementation. The literature can shown in the Fig. 1, the permutation block takes 256 one-hot
be roughly categorized into S-boxes that contain optimized coded decoder outputs and connects them to the inputs of
IV. IMPLEMENTATION OF AES CORE ARCHITECTURE

D E The aim of this design is to minimize the power–area–


8 E N 8 latency triple and such minimization is achieved by the use of
C PERMUTA- C appropriate resource sharing, simple compact memory
256 TION 256
O O architecture, adopting a truly 8-bit data path width, and by
D BLOCK D controller optimization.
E E
Fig. 1 shows the overall circuit for the design. For the first
R R
16 clock cycles, the key is fed into the key memory. For the
next 16 cycles, as the plaintext is supplied, the first AES round
Figure 1. DSE S-Box.
is processed (simply AddRoundKey), and the results are stored
in the data memory. The middle round processing proceeds in
column order. The final round is similar to the middle rounds
except that MixColumns is bypassed and the result bytes are
stored in the output register. The minimum memory
requirement for the AES to store both the working state and
the current RoundKey is 2×128 bits. Two 16-byte memories-
one for the RoundKey and the second for the state were used.
The increased flow of operands allows the most independent
operations on key and state to occur in parallel. A 4-byte 8-bit
shift register is required to retain the ability to use single-port
memories and avoid stalling processing due to refetching
operands for MixColumns. The ShiftRows operation is
automatically performed in the order in which bytes are
addressed to form the required column for MixColumns, after
which the column is written back to the same locations, and
the order is compensated for by the addressing scheme for
each subsequent round. In this design a single S-Box is shared
between KeyExpansion and Encryption modules.
KeyExpansion is performed on the fly and is interleaved with
the state processing, and thus, needs no additional cycles. Here
the controller is implemented as a finite-state machine
consisting of only 15 states and is supported by three 4-bit
counters. A total of 352 cycles are taken for Encryption,
including both key and data I/O.

Figure 2. 4-Stage encoder structure for DSE S-Box

encoder in a way to achieve required S-Box functionality. Since


one and only active line will change place inside the string of
256 bits, the function virtually consumes zero power. In order to
reduce the total dynamic power consumption, Betroni proposed
a 3-stage decoder but for encoder he used the structure given by
the synthesis tool.
The second proposed method is an improvement to DSE S-
Box to further reduce its area and power consumption. Here
instead of using the encoder structure given by synthesis tool, a
4-stage encoder was developed that has balanced signal paths to
eliminate the dynamic hazards which are the cause of
unnecessary power consumption. Chip area is also reduced by
the maximum reuse of the gates. This resulted in reduced power
consumption and area compared with Betroni implementation.
Synthesis results for all the above S-Boxes are given in table I. Figure 3. Low resource circuit for Mixcolumns, including final round
The reported area is the number of the gate equivalents (GE’s bypass.
i.e. the total area divide by the area of a NAND gate in the
used technology).
Figure 4. Block diagram of AES core architecture

TABLE I. SYNTHESIS RESULTS OF DIFFERENT S-BOXES

S.No S-Box Total Area Max SubBytes and ShiftRows operations. In addition to the
Implementation power (GE’s) delay arithmetic, further optimization using smaller inverting gates
(µW) (ns) has been made to bypass Mixcolumns operation in the final
1 HW-LUT 1.98 669 2.88 round and the final circuit is shown in Fig. 3 which requires a
2 Combinational logic 35.87 329 4.96 total area of 315 GE.
(CL)
3 Mix of CL & LUT 31.33 285 5.51
V. EXPERIMENTAL RESULTS

4 Decoder Switch Encoder 1.84 632 2.71 The AES architecture is described in Verilog HDL at the
(DSE) register-transfer level. Synthesizing the RTL description into
5 Improved DSE 1.55 596 2.49 the gate level was done using 0.18 μm, 1.8 V, standard-cell
CMOS technology. Synopsys VCS tool was used for Functional
At the 8-bit level, MixColumns is also challenging as it is and Timing simulations, Design Compiler to synthesize the
mathematically equivalent to a 32-bit operation. The Verilog description of the design into a technology-mapped
MixColumns transformation operates on the state column by netlist and for Static Timing Analysis (STA). All the results in
column, treating each column as a four term polynomial. The Table II are presented under typical operating conditions (1.8
columns are considered as polynomials over GF(28) and V and 25 ºC). The power has been measured based on the
multiplied modulo x4+1 with a fixed polynomial a(x) given by nodes switching activities in the gate level circuit with typical
a x 03 x 01 x 01 x 02 . Mixcolumns test vectors as stimulus using Synopsys prime power.
operation can be expressed in matrix form as shown below. Here power–area–latency product was used as performance
metric to compare different implementations. The results for a
10MHz clock frequency and comparison with the previous state-
, 02 03 01 01 ,
of-the-art AES designs were presented in Table II. A number of
, 01 02 03 01 , high-throughput designs have been included in this table to
= for 0<=c<Nb (3)
, 01 01 02 03 , illustrate their inapplicability to a low-resource environment
,
03 01 01 02 , where power, area, and latency are all important. The power-
area-latency product of the implemented design using two
This may be simplified to a series of finite-field doubling f2, proposed S-Boxes is considerably reduced when compared with
tripling f3, and addition (XOR) operations. As shown in [7], all the previous implementations except [3]. The reason for
this can be done using a sequence of 8-bit operations; higher power consumption of this design compared with [3] is
however, this requires 12 cycles for each 32-bit MixColumns due to large difference in the core voltage from 0.8 V to 1.8 V,
operation. A compromise [6] is to use a shift register supplied hence this design is definitely more efficient than [3].
with 8-bit data and perform a 32-bit in, 8-bit out version of The operation speed of the AES can be determined by the data
MixColumns, and cycle the data to yield the 32-bit operator. throughput, expressed by (4). The maximum frequency of this
However, this approach requires only seven cycles for each design is 110 MHz and the operation cycle or latency (N-cycle)
32-bit operation and it is more efficiently integrated with the is 352, hence the maximum throughput Tthroughput is 40 Mb/s.
TABLE II. COMPARISON OF VARIOUS AES IMPLEMENTATIONS

Design Type Mode Tech Core Power Area Latency (cycles/clock freq) Efficiency, P-A-T µ J-gates
µm V µW kgates
Kuo 2003 [10] Chip n/k a 0.18 1.8 56,000 173 12 cycles/154MHz = 77.9ns 56mW*173K*77.9ns = 754.28
Feldhofer 2004 [6] Chip ECB 0.35 1.5 4.5 4.4 1032 cycles/100KHz = 10.32ms 4.5µW*4.4K*10.32ms = 204.34
Hsiao 2006 [11] Synth ECB 0.18 n/k 34,000 15 10 cycles/104MHz = 96.2ns 34mW*15K*96.2ns = 49.07
Kaps 2006 [9] Synth CBC 0.13 1.2 20.23 4.1 534 cycles/500KHz = 1.07ms 20.23µW*4.1K*1.07ms = 88.56
Lin 2007 [12] Synth many 0.13 1.2 40,900 86.2 10 cycles/333MHz = 30.0ns 40.9mW*86.2K*30ns = 106.02
Tim Good 2010 [3] Chip Enc 0.13 0.8 99 5.5 356 cycles/12MHz = 29.67µs 99µW*5.5K*29.67µs = 16.17
This work (with mix Synth Enc 0.18 1.8 174 4.3 352 cycles/10MHz = 35.2µs 174µW*4.3K*35.2µs = 26.34
of CL & LUT S-Box)
This work (with Synth Enc 0.18 1.8 143 4.59 352 cycles/10MHz = 35.2µs 143µW*4.59*35.2 = 23.10
improved DSE S-Box)
a. Supports 128,192 and 256 bit keys

[4] NIST, “Recommendation for block cipher modes of operation,”


T (4) Special Publication SP-800-38A, 2001. [Online]. Available:
N
http://csrc.nist.gov/publications/PubsSPs.html
[5] Edwin NC Mui, “Practical implementation of Rijndael S-Box using
Throughput of this design at 10MHz clock frequency is 3.6 combinational logic,” 2004.
Mb/s, this data rate is more than sufficient for resource-critical [6] M. Feldhofer, J. Wolkerstorfer, and V. Rijmen, “AES implementation
applications. Finally, majority of the power and area gains in on a grain of sand,” Proc. Inst. Electr. Eng. Inf. Security, vol. 1, pp.
this design have been made by utilizing a more efficient 13–20, 2005.
generic cycle-by-cycle design for the AES, together with an [7] T. Good and M. Benaissa, “Very small FPGA application-specific
instruction processor for AES,” IEEE Trans. Circuits Syst. I, Reg.
improved memory and MixColumns architectures for an 8-bit Papers, vol. 53, no. 7, pp. 1477–1486, Jul. 2006.
data path [8] X. Zhang and K.K. Parhi, “Implementation approaches for the
AESalgorithm,” IEEE Circuits and systems Magazine, Vol .2, pp
VI. CONCLUSION 1477-1486, 2002.
This paper presents an optimized ASIC implementation of [9] J.-P. Kaps and B. Sunar, “Energy comparison of AES and SHA-1 for
ubiquitous computing,” in Proc. Embedded Ubiquitous Comput.
the AES for resource critical applications. Here five AES S- (EUC), Seoul, Korea, Aug. 2006, pp. 372–381.
box implementations which follow three different design [10] H. Kuo, I. Verbauwhede, and P. Schaumont, “A 2.29 Gbits/sec,
strategies were analyzed and compared various cost metrics 56mW non-pipelined Rijndael AES encryption IC in a 1.8 V 0.18 um
like critical path delay, silicon area, and power consumption CMOS technology,” in Proc. CICC, Orlando, FL, 2002, pp. 147–150.
of these implementations based on synthesis runs with a [11] S.-F. Hsaio, M.-C. Chen, and C.-S. Tu, “Memory-free low-cost
0.18 μm CMOS standard cell library. In addition, two new designs of advanced encryption standard using common
subexpression elimination for subfunctions in transformations,” IEEE
methods were proposed for implementing S-Box which has Trans. Circuits Syst.I, Reg. Papers, vol. 53, no. 3, pp. 615–626, Mar.
reduced area and power consumption. In this design 2006.
throughput is sacrificed in favor of reducing power and area [12] S.-Y. Lin and C.-T. Huang, “A high-throughput low-power AES
consumption. In comparison with other designs, it shows the cipher for network applications,” in Proc. ASP-DAC, Yokohama,
best P-A-T efficiency, i.e., 23.10 μJ-gates. The design Japan, Jan. 2007, pp. 595–600.
[13] A. Satoh, S. Morioka, K. Takano, and S. Munetoh, “A compact
decision was to incorporate only AES Encryption as this is Rijndael hardware architecture with S-box optimization,” in Proc.
the minimum requirement for a number of useful modes ASIACRYPT, Gold Coast, Qld., Australia, Dec. 2001, vol. 2248,
(OFB, CTR, CFB, and CCM), all of which can provide data Lecturer Notes in Computer Science, pp. 239–254.
encryption and decryption using only an encryption
primitive.
REFERENCES
[1] National Institute of Standards and Technology (NIST), “Federal
Information Processing Standards (FIPS) Publication 197,” Advanced
Encryption Standard, Nov. 2001.
[2] G. Bertoni, M. Macchetti, L. Negri, and P. Frangneto, “Power-
efficient ASIC synthesis of Cryptographic S-boxes,” In Proceedings
of the 14th ACM Creat Lakes symposium on VLSI (GLSVLSI 2004),
pp. 277-281, ACM Press, 2004.
[3] Tim Good and Mohammed Benaissa, “692-nW Advanced Encryption
Standard (AES) on a 0.13µm CMOS,” IEEE Transactions on Very
Large Scale Integration (VLSI) Systems, vol. 18, No. 12, pp. 1753-
1757, December 2010.

Anda mungkin juga menyukai