AES algorithm
K. Vidya Sagar, R. Thilagavathy
Dept. of Electronics & Communication,
National Institute of Technology,
Trichy, India-620015.
Mobile: 9042711438
e-mail: sagar123.k@gmail.com
Abstract—This paper presents a power and area efficient implemented with the goal of reducing area and power
hardware implementation of the Advanced Encryption Standard consumption to use in resource critical applications.
(AES) algorithm. In this a comprehensive study of different
implementations of the AES S-box with respect to timing (i.e. Designs typically aiming for high throughput based on 128-
critical path), silicon area and power consumption are evaluated bit data paths have been widely reported, and they provide
to choose a suitable S-Box. This architecture utilizes a split 8-bit highly efficient solutions for gigabits-per-second throughputs.
data path between key and state processing, with resource sharing Many of the attempts at application-specific integrated circuit
of the S-Box operation to minimize area and power consumption. (ASIC) designs for the AES considered only a 32-bit data path
The AES core consumes an area of 4.59k and 143 µW of power at as the minimum. A good example is the work of Satoh et al.
10 MHz in 0.18 µm standard-cell CMOS technology. [13]. However, in the interests of lower power, an 8-bit data
path has been explored more recently. In a previous design [7],
Keywords-Advanced Encryption Standard (AES); Encryption; the authors explored the application of an application specific
Substitute Byte (S-Box); MixColumns; Application Specific instruction processor (ASIP) for a field-programmable gate
Integrated Circuit (ASIC); array (FPGA), which utilized a truly 8-bit data path. Although
I. INTRODUCTION the design in [7] was the smallest reported design for the AES
on an FPGA, it still required 13546 cycles to perform the
In today’s digital world, encryption is emerging as a encryption (including key expansion), which can be too many
disintegrable part of all communication networks and for some of the applications already cited before. In this paper,
information processing systems, for protecting both stored and an area and power efficient architecture is presented for the
in transit data. Encryption is the transformation of plain data implementation of the AES using a 8-bit data path between key
(known as plaintext) into unintelligible data (known as cipher and state processing, with resource sharing of the SubBytes
text) through an algorithm referred to as Cipher. There are operation, which yields significant improvements over the
numerous encryption algorithms that are now commonly used previous best-case designs.
in computation, but the National Institute of Standards and
Technology (NIST) selected Rijndael and named it as the The total work in this paper can be divided into two major
Advanced Encryption Standard (AES) on 26th November 2001 parts. First part is dedicated for studying different S-Box
when Data Encryption Standard (DES) was no longer secure implementation methods available in the literature and
for protecting sensitive information. proposed two new methods for realizing S-Box which
minimizes area and power consumption. In the second part,
The AES algorithm is extremely flexible and may be importance is given for selecting an architecture which utilizes
implemented using different hardware architectures from a split 8-bit data path between key and state processing with
deeply pipelined loop-unrolled implementations that support resource sharing of the SubBytes operation, implementing a
many gigabits-per-second throughputs to iterative single-round low resource circuit for Mixcolumns and an optimized
implementations through resource-shared designs, which implementation of controller.
sacrifice throughput in favour of saving power and area. There
is a continued demand for better hardware implementation and The remainder of this paper is organized as follows.
there exists a number of application areas that seek even lower Section II briefly describes the AES followed by different
resource designs for block ciphers such as the AES, which methods for implementing S-Box in Section III. Details of
require modest data rates of less than 1 Mbps. Inductively implementing AES core architecture is given in Section IV.
powered RF identification (RFID), wireless sensor networks The synthesis results of the design are given in Section V. The
(WSNs), and smart cards can have clock frequencies, derived paper ends by drawing some conclusions in Section VI.
from the energizing RF, as low as 100 kHz with power
limitations of a few microwatts, latency requirements of a few II. AES ALGORITHM
milliseconds, together with an area limitation of a few thousand The AES algorithm is a subset of a much larger
gates are some of the resource critical applications. In this encryption algorithm known as Rijndael. NIST announced the
paper a hardware implementation of the AES algorithm is
Rijndael algorithm as the winner due to the best overall score circuits for arithmetic in GF(28) [5], constructions using
in security, performance, efficiency, implementation capability hardware look-up tables, and dedicated low-power solutions
and simplicity. The AES algorithm is a symmetric-key iterative [2], all of which have their specific advantages and
block cipher, and operates on a data block of length 128 bits. The disadvantages with respect to area, delay, and power
key length can be 128, 192, or 256 bits depending on the desired consumption. In this work S-Boxes that fall into above three
level of security. The algorithm repeats for Nr number of rounds. categories were examined and two new methods were
The value of Nr depends on the key length and its value can be proposed which consume less area and power consumption
10, 12, or 14 for a key length of 128,192, or 256 bits, than existing methods.
respectively. The data block of 128 bits is divided into 16 bytes The simplest design in our comparison is a straightforward
and is viewed as a rectangular block array of bytes. This array implementation of a hardware look-up table. Implementing S-
has four rows. The number of columns is denoted by Nb and is box directly as a LUT, using either ROM or letting the synthesis
equal to the block length divided by 32. The cipher key is tool to build combinatorial logic from the truth table does not
similarly viewed as a rectangular array with four rows. The result in the optimal solution in terms of gate area in ASIC
number of columns of the cipher key is denoted by Nk and is technologies. This approach can efficiently be used in FPGAs.
equal to the key length divided by 32. Both forward (encryption) To understand the complexity of this type of S-Box in ASIC, it is
and inverse (decryption) operations can be performed using AES implemented using Verilog HDL. This approach will be
algorithm. denoted as HW-LUT.
AES supports different modes of operation and both Second method of implementing S-Box is by using
encryption and decryption operations are required to implement combinational logic. Here The S-box is constructed by first
the electronic code book (ECB) mode for the cipher. However, finding multiplicative inverse (MI) of the input byte in GF(28)
there are other modes that require only the AES encryption in with respect to the polynomial p(x)= x8+x4+x3+x+1, and then
order to function, providing both encryption and decryption applying affine transformation. The affine transformation
operations. Such modes include counter (CTR), output feedback involves multiplication with a matrix and adding (exclusive-
(OFB), and cipher feedback (CFB). These modes are approved Oring) a constant 63H. Since the computation of MI in GF(28)
by the NIST [4] for use with the AES. A more recent is hardware intensive operation, it is done by decomposing
authenticated mode, counter with CBC-MAC (CCM), may also more complex GF(28) into lower order fields of GF(21), GF(22)
be supported using only the encryption primitive. and GF((22)2) by using irreducible polynomials given in [5].
In the encryption process in AES algorithm, following an Any element in GF(28) may be represented as bx + c given
initial AddRoundKey step, each round except the final round an irreducible polynomial of x2 + Ax + B, where b is the most
consists of four transformations namely the SubBytes, ShiftRows, significant nibble while c is the least significant nibble. From
MixColumns and the AddRoundKey. The final round has only here, the multiplicative inverse can be computed using the
the SubBytes, ShiftRows and AddRoundKey transformations. equation given below.
Considering the state as a 4×4 matrix of 8-bit values, the
operators may be conveniently defined. The ShiftRows operator bx c b B bcA c x c bA b B
rotates each row of the State to the left using a specific offset. bcA c (1)
The offset equals the row index (starting at 0), which means that
the first row is not rotated at all and the last row is rotated by From [5], the irreducible polynomial that was selected was
three bytes to the left. SubBytes operation is a non-linear byte x2 + x + λ. Since A = 1 and B = λ, then the equation could be
substitution that operates independently on each byte of the State simplified to the form as shown below.
using a substitution table (S-box). The MixColumns operator
performs a set of fixed-value GF(28) multiplications and bx c b b λ c b c x c b b λ
essentially operates on columns of the state. The final operation c b c (2)
AddRoundKey is simply the bitwise XOR of the state and the
RoundKey. The KeyExpansion utilizes four SubBytes operations The above equation indicates that there are multiply,
followed by some GF additions to yield the set of RoundKeys. addition, squaring and multiplication inversion operations in
The expansion operation also incorporates a byte-wise rotation GF(24). Complete details on how to implement squaring,
and addition of a round-specific constant Rcon. These constants multiplication, addition and multiplicative inversion in GF(24)
can be derived using finite-field doubling. can be found in [5].
Third method for implementing S-Box is known as hybrid-
III. DIFFERENT IMPLEMENTATIONS FOR S-BOX
LUT, which uses LUT approach for implementing inversion in
The S-box is a costly and performance critical building GF(24) and rest of the circuit is implemented using
block of the AES algorithm In addition, the S-box also combinational logic. This proposed method resulted in less
impacts area and power consumption of AES hardware. area and power consumption than first two methods.
Therefore, the AES S-box has been a subject of intensive Fourth method, Decoder-Switch-Encoder (DSE) S-Box is
research in recent years, which has led to a rich literature on proposed by Betroni for achieving low power solution [2]. As
efficient S-box design and implementation. The literature can shown in the Fig. 1, the permutation block takes 256 one-hot
be roughly categorized into S-boxes that contain optimized coded decoder outputs and connects them to the inputs of
IV. IMPLEMENTATION OF AES CORE ARCHITECTURE
S.No S-Box Total Area Max SubBytes and ShiftRows operations. In addition to the
Implementation power (GE’s) delay arithmetic, further optimization using smaller inverting gates
(µW) (ns) has been made to bypass Mixcolumns operation in the final
1 HW-LUT 1.98 669 2.88 round and the final circuit is shown in Fig. 3 which requires a
2 Combinational logic 35.87 329 4.96 total area of 315 GE.
(CL)
3 Mix of CL & LUT 31.33 285 5.51
V. EXPERIMENTAL RESULTS
4 Decoder Switch Encoder 1.84 632 2.71 The AES architecture is described in Verilog HDL at the
(DSE) register-transfer level. Synthesizing the RTL description into
5 Improved DSE 1.55 596 2.49 the gate level was done using 0.18 μm, 1.8 V, standard-cell
CMOS technology. Synopsys VCS tool was used for Functional
At the 8-bit level, MixColumns is also challenging as it is and Timing simulations, Design Compiler to synthesize the
mathematically equivalent to a 32-bit operation. The Verilog description of the design into a technology-mapped
MixColumns transformation operates on the state column by netlist and for Static Timing Analysis (STA). All the results in
column, treating each column as a four term polynomial. The Table II are presented under typical operating conditions (1.8
columns are considered as polynomials over GF(28) and V and 25 ºC). The power has been measured based on the
multiplied modulo x4+1 with a fixed polynomial a(x) given by nodes switching activities in the gate level circuit with typical
a x 03 x 01 x 01 x 02 . Mixcolumns test vectors as stimulus using Synopsys prime power.
operation can be expressed in matrix form as shown below. Here power–area–latency product was used as performance
metric to compare different implementations. The results for a
10MHz clock frequency and comparison with the previous state-
, 02 03 01 01 ,
of-the-art AES designs were presented in Table II. A number of
, 01 02 03 01 , high-throughput designs have been included in this table to
= for 0<=c<Nb (3)
, 01 01 02 03 , illustrate their inapplicability to a low-resource environment
,
03 01 01 02 , where power, area, and latency are all important. The power-
area-latency product of the implemented design using two
This may be simplified to a series of finite-field doubling f2, proposed S-Boxes is considerably reduced when compared with
tripling f3, and addition (XOR) operations. As shown in [7], all the previous implementations except [3]. The reason for
this can be done using a sequence of 8-bit operations; higher power consumption of this design compared with [3] is
however, this requires 12 cycles for each 32-bit MixColumns due to large difference in the core voltage from 0.8 V to 1.8 V,
operation. A compromise [6] is to use a shift register supplied hence this design is definitely more efficient than [3].
with 8-bit data and perform a 32-bit in, 8-bit out version of The operation speed of the AES can be determined by the data
MixColumns, and cycle the data to yield the 32-bit operator. throughput, expressed by (4). The maximum frequency of this
However, this approach requires only seven cycles for each design is 110 MHz and the operation cycle or latency (N-cycle)
32-bit operation and it is more efficiently integrated with the is 352, hence the maximum throughput Tthroughput is 40 Mb/s.
TABLE II. COMPARISON OF VARIOUS AES IMPLEMENTATIONS
Design Type Mode Tech Core Power Area Latency (cycles/clock freq) Efficiency, P-A-T µ J-gates
µm V µW kgates
Kuo 2003 [10] Chip n/k a 0.18 1.8 56,000 173 12 cycles/154MHz = 77.9ns 56mW*173K*77.9ns = 754.28
Feldhofer 2004 [6] Chip ECB 0.35 1.5 4.5 4.4 1032 cycles/100KHz = 10.32ms 4.5µW*4.4K*10.32ms = 204.34
Hsiao 2006 [11] Synth ECB 0.18 n/k 34,000 15 10 cycles/104MHz = 96.2ns 34mW*15K*96.2ns = 49.07
Kaps 2006 [9] Synth CBC 0.13 1.2 20.23 4.1 534 cycles/500KHz = 1.07ms 20.23µW*4.1K*1.07ms = 88.56
Lin 2007 [12] Synth many 0.13 1.2 40,900 86.2 10 cycles/333MHz = 30.0ns 40.9mW*86.2K*30ns = 106.02
Tim Good 2010 [3] Chip Enc 0.13 0.8 99 5.5 356 cycles/12MHz = 29.67µs 99µW*5.5K*29.67µs = 16.17
This work (with mix Synth Enc 0.18 1.8 174 4.3 352 cycles/10MHz = 35.2µs 174µW*4.3K*35.2µs = 26.34
of CL & LUT S-Box)
This work (with Synth Enc 0.18 1.8 143 4.59 352 cycles/10MHz = 35.2µs 143µW*4.59*35.2 = 23.10
improved DSE S-Box)
a. Supports 128,192 and 256 bit keys