1

2008 Annual Computer Security Applications Conference
Instruction Set Extensions for Enhancing the Performance of Symmetric-Key

Cryptography
Sean O’Melia ∗, Member, IEEE, and AJ Elbirt †, Member, IEEE
Abstract session key and to provide authenticity through digital sig-

natures. The session key is then used in conjunction with
Instruction set extensions for a RISC processor are pre- a symmetric-key algorithm. Symmetric-key algorithms tend
sented to improve the software performance of the Data En-
cryption Standard (DES), Triple-DES, the International Data to be significantly faster than public-key algorithms and as
Encryption Algorithm (IDEA), and the Advanced Encryption a result are typically used in bulk data encryption [37]. The
Standard (AES) algorithms. The most computationally in- two types of symmetric-key algorithms are block ciphers and
tensive operations of each algorithm are off-loaded to a set stream ciphers. Block ciphers operate on a block of data
of newly defined instructions. The additional hardware re- while stream ciphers encrypt individual bits or bytes. Block
quired to support these instructions is integrated into the
processor’s datapath. For each of the targeted algorithms, ciphers are typically used when performing bulk data encryp-
comparisons are presented between traditional software im- tion and the data transfer rate of the connection directly fol-
plementations and new implementations that take advantage lows the throughput of the implemented algorithm.
of the extended instruction set architecture. Results show that
utilization of the proposed instructions significantly reduces High throughput encryption and decryption are becoming
program code size and improves encryption and decryption increasingly important in the area of high-speed networking.
throughput. Moreover, the additional hardware resources re- Many applications demand the creation of networks that are
quired to support the instruction set extensions increases the
total area of the processor by less than 65%. both private and secure while using public data transmission
links. These systems, known as Virtual Private Networks
(VPNs), can demand encryption throughputs at speeds ex-
ceeding Asynchronous Transfer Mode (ATM) rates of 622
Keywords: symmetric-key, cryptography, software, FPGA
million bits per second (Mbps). Increasingly, security stan-
dards and applications are defined to be algorithm indepen-
1 Introduction dent. Although context switching between algorithms can
be easily realized via software implementations, the task is
With more than 188 million Americans connected to the significantly more difficult when using hardware implemen-
Internet [12], information security has become a top priority. tations. The advantages of a software implementation in-
Many applications — electronic mail, electronic banking, clude ease of use, ease of upgrade, ease of design, porta-
medical databases, and electronic commerce — require the bility, and flexibility. However, a software implementation
exchange of private information. For example, when engag- offers only limited physical security, especially with respect
ing in electronic commerce, customers provide credit card to key storage [37]. Conversely, cryptographic algorithms
numbers when purchasing products. If the connection is not that are implemented in hardware are by nature more physi-
secure, an attacker can easily obtain this sensitive data. In or- cally secure as they cannot easily be read or modified by an
der to implement a comprehensive security plan for a given outside attacker when the key is stored in special memory in-
network to guarantee the security of a connection, Confiden- ternal to the device. As a result, the attacker does not have
tiality, Data Integrity, Authentication, and Non-Repudiation easy access to the key storage area and cannot discover or
must be provided [37]. alter its value in a straightforward manner [37]. Although
Cryptographic algorithms used to ensure confidentiality traditional hardware implementations lack flexibility, config-
fall within one of two categories: private-key (also known as urable hardware devices offer a promising alternative for the
symmetric-key) and public-key. Symmetric-key algorithms implementation of processors via the use of IP cores in Ap-
use the same key for both encryption and decryption. Con- plication Specific Integrated Circuit (ASIC) and Field Pro-
versely, public-key algorithms use a public key for encryp- grammable Gate Array (FPGA) technology. To illustrate,
tion and a private key for decryption. In a typical session, Altera Corporation offers IP core implementations of the In-
a public-key algorithm will be used for the exchange of a tel 8051 microcontroller and the Motorola 68000 processor
∗ Sean O’Melia is a Associate Member of Technical Staff at MIT Lincoln
in addition to their own Nios R
-II embedded processor. Sim-
Laboratory. Email: sean.omelia@gmail.com
ilarly, Xilinx Inc. offers IP core implementations of the Pow-
† AJ Elbirt is a Senior Member of Technical Staff at The Charles Stark erPC processor in addition to their own MicroBlazeT M and
Draper Laboratory. Email: aelbirt@draper.com PicoBlazeT M embedded processors. ASIC and FPGA tech-
1063-9527/08 $25.00 © 2008 IEEE 465

DOI 10.1109/ACSAC.2008.10
nologies provide the opportunity to augment the existing dat- previous work on improving the performance of the AES
apath of a processor implemented via an IP core to add ac- algorithm on 32-bit systems, it has been shown that trans-
celeration modules supported through newly defined instruc- forming a block of plaintext from a column-oriented matrix
tion set extensions targeting performance-critical functions to a row-oriented matrix reduces the number of instructions
[13], [19], [39]. Instruction set extensions result in signifi- required to complete the cipher due to more efficient imple-
cant performance improvements versus traditional software mentation of the Galois Field matrix multiplication opera-
implementations with considerably reduced logic resource tions [1].
requirements versus hardware-only solutions [4], [23], [24], In order to extend the cryptographic capabilities of an
[32], [33]. embedded system without modifying the main processor, a
What follows is an overview on the various methods of co-processor solution can be adapted. When there is data
speeding up symmetric-key algorithms in software. It will that must be encrypted or decrypted through the chosen
be shown that advances in technology have fueled trends to- symmetric-key algorithm, the main processor sends the data
wards increased reconfigurability in embedded systems, re- and key material to the co-processor, and the co-processor
sulting in instruction set extensions becoming an attractive performs the algorithm, sending the processed data back over
option for performance-critical applications. A discussion of the interface to the main processor. Most co-processor so-
the target processor, the LEON2 RISC processor, will be fol- lutions have tended to combine a number of different al-
lowed by an analysis of the performance bottlenecks com- gorithms to provide a multi-faceted security solution. Co-
monly encountered in software implementations of the tar- processors have achieved high throughput values compared
geted cryptographic algorithms. The proposed instructions to traditional software implementations and therefore are
will then be presented, followed by a description of the mod- much more capable of meeting demands for speed-critical
ifications to the LEON2 processor and its associated devel- network communications. However, this type of solution re-
opment tools. Finally, data on the logic utilization of the quires considerable overhead in terms of hardware area, data
additional hardware as well as throughput data for the target transfer latency, and processor interfaces [4], [5], [11], [15],
algorithms achieved using the instruction set extensions will [16], [21], [28], [35], [36], [40].
be presented to demonstrate the effectiveness of this acceler- Previous work on instruction set extensions for general-
ation method. ized permutations are useful for improving the performance
of permutations used in the DES algorithm. Two new instruc-
2 Previous Work tions for general and dynamically specified permutations are
presented in [38]. The input and a string of configuration bits
Most traditional methods for improving the throughput of are specified in the source operands and the result is stored
pure software implementations of symmetric-key algorithms in the destination register. Permutations of n bits required
fall into one of two categories. One option is to construct log2 (n) issues of the custom instructions as well as several
memory-based look-up tables where results of some of the loads of configuration bits into registers. The MOSES plat-
basic operations of the algorithm have been pre-computed form, based on the Xtensa T1040, is a RISC-like processor
and stored. The substitution boxes, or S-Boxes, of the DES designed to be easily extended with additional hardware and
and AES algorithms are commonly stored in look-up tables supporting instructions. Throughput improvement factors of
in software implementations. Look-up tables may also be 31.0 for DES, 33.9 for Triple-DES, and 17.4 for AES were
used to combine operations used in the DES and AES al- reported for this architecture [31]. Similarly, instruction set
gorithms. AES requires several complicated mathematical extensions that perform the mathematical operations in the
operations that are time-consuming on general-purpose pro- AES rounds using custom functional units integrated into
cessors. Therefore in some implementations, large look-up the targeted processor’s datapath were investigated in [33].
tables, called T-tables, are employed that combine several These extensions minimize the number of memory accesses,
of these complex operations into a single table access [34]. usually by combining the SubBytes and MixColumns trans-
Look-up table based implementations are viable for systems formations into one T-table look-up operation to speed up al-
with large memory spaces and low access times. However, gorithm execution. While T-table performance is dependent
area-constrained systems suffer large performance penalties upon available cache size, these extensions have yielded per-
using this methodology and are typically not implemented in formance improvements of up to a factor of 3.68 versus AES
this manner [33], [34]. implementations without the use of the instruction set exten-
Another method for speeding up software implementa- sions [16], [18], [31], [36].
tions of cryptographic algorithms involves taking advantage Several software implementations of the IDEA algorithm
of mathematical or structural properties of the particular al- take advantage of advanced processor architectures that em-
gorithm. The Initial and Final Permutations of the DES algo- ploy instruction parallelism or functional units for multime-
rithm have regular structures that make it possible to execute dia support. A four-way parallel implementation on a 166
a series of matrix transformations and XOR operations as MHz Pentium MMX processor [25] achieved a throughput
demonstrated in [29]. This translates into a sequence of in- of approximately 72 Mbps. Throughput values ranging from
structions that is much smaller than the traditional sequence 421 Mbps to 550 Mbps have been achieved on the Itanium
required to perform the Initial and Final Permutations. In platform running at 733 MHz [14]. The performance evalu-
466
ations reported in [10] include a comparison of IDEA soft- of the LEON2 directory structure. Software code is placed in
ware implementations on processors with various word sizes, the /tsource/ sub-folder in a format readable by the test bench
clock frequencies, and cache sizes. Execution times for VHDL code. The software can then be read and executed by
IDEA encryption ranged from 2,555 µs on the 8-bit 4 MHz the test bench. In order to facilitate the development of pro-
Atmega 103 to 9 µs on the 64-bit 440 MHz UltraSparc2 R
grams targeting the LEON2 processor, Gaisler Research has
with instruction and data cache sizes of 16 Kbytes. Fast mul- provided a series of compilers and simulators that may be
tiplication capability was shown to be a major factor in the chosen depending on the software environment. For stand-
performance of the IDEA algorithm. alone applications, the Bare C Compiler, based on the GNU
Implementations of IDEA on reconfigurable computing Compiler Collection and GNU binutils, is recommended.
platforms and systems with co-processors have shown im-
proved performance. An implementation on an SRC-6E 4 Evaluation of the Target Algorithms
platform [27] achieved throughputs of approximately 590
Mbps for end-to-end software time for bulk data process- Software implementations of DES tend to be significantly
ing. Comparisons have been made between the performance slower than hardware implementations. Bit-level manipula-
of IDEA on Digital Signal Processors (DSPs), cryptographic tions such as those contained in the permutation, expansion,
co-processors, and hardware implementations on FPGAs in permuted choice, and Cyclic Left/Right Shift units do not
a hardware-software co-design system that makes use of en- map well to general-purpose processors. General-purpose
cryption in a mobile device. Reported performance figures processor instruction sets operate on multiple bits at a time
ranged from 32 Mbps on the DEC SA-110 and 53.1 Mbps based on the processor word size. Moreover, the DES S-
on the TI TMX320C6x DSP chips, to 180 Mbps using the Boxes do not use memory in an efficient manner. Soft-
VINCI cryptographic co-processor, to 528 Mbps with an ware look-up tables would appear to be the obvious imple-
FPGA-based implementation [26]. mentation choice for the DES S-Boxes. However, the DES
S-Boxes have 6-bit addresses and 4-bit output bits while
3 The LEON2 Processor most memories associated with general-purpose processors
use byte addressing with either 8-bit or 32-bit output data.
The LEON2 processor is a RISC CPU produced by As a result, many software implementations of DES exhibit
Gaisler Research that is implemented in VHDL and is fully throughputs that are at least a full order of magnitude slower
synthesizable. The model is highly configurable and the than hardware implementations.
source code is freely available under the GNU General Public Even the best software implementations are only capable
License which enables modifications and enhancements to of throughputs in the range of 100–200 Mbps. Most of these
the architecture. The LEON2 processor is based on the Scal- implementations recommend storing the 32-bit left and right
able Processor Architecture (SPARC R
) and features a fully halves of the data stream as a 48-bit padded word within a
synchronous design with a single clock, use of multiplexers 64-bit processor word and implementing the permutations
for loading of pipeline registers, separate combinational and and S-Boxes as precomputed look-up tables. Additionally,
sequential processes, and record types for interconnection of look-up table implementation for the S-Boxes is most effec-
component I/O signals. tive when the size of the look-up tables is minimized, guar-
All SPARC R
V8 instructions are implemented in the anteeing that the data will fit entirely in on-chip cache. Size
LEON2 processor architecture. Instructions are grouped ac- minimization of the S-Box look-up tables is achieved by im-
cording to the values of the various fields in the instruction plementing each S-Box in its own look-up table. Finally, one
opcode. Most of the available features of the LEON2 pro- key software optimization is the unrolling of software loops
cessor can be enabled, disabled, or adjusted. This research to increase performance. Even when software loops are too
employed a basic configuration with no FPU, PCI, Ethernet, cumbersome to unroll, using loop counters that decrement to
co-processor interface, or hardware multiplier or divider. To zero in place of loop counters that increment to a terminal
extend the LEON2 architecture beyond the scope of the stan- count are shown to greatly increase the performance of soft-
dard model, additional VHDL code is required. The specific ware implementations of the DES algorithm. However, the
files that must be modified depend on the functionality to be unrolling of software loops must be done with great care such
added, but if the instruction set is to be extended, the mod- that the total data storage space does not exceed the size of
ule containing the SPARC R
V8 opcode constants must be the on-chip cache to avoid extreme performance degradation
updated. [3], [17], [30].
The LEON2 implementation can be targeted to any type In terms of the core operations of IDEA, bit-wise XOR
of FPGA or ASIC technology. Pre-made packages enable and addition are easily implemented with one instruction
use of technology-specific cells to directly instantiate or au- each in software. For reduction modulo 216 , a processor such
tomatically infer the register files, caches, PCI FIFOs, and as the LEON2 that only performs arithmetic on 32-bit regis-
I/O pads. Functional verification and performance evalua- ter operands requires an additional logic instruction to mask
tion of programs built for the LEON2 architecture can be out the bits that may overflow into the sixteen most signifi-
performed with the provided generic test bench. The VHDL cant bits of the destination register. However, the major per-
source for the test bench is located in the /tbench/ sub-folder formance bottleneck for a software implementation of IDEA
467
is multiplication modulo 216 + 1. Multiplication may require S-Boxes, one S-Box, and four S-Boxes. For all other types
several clock cycles to complete (especially those without of extensions, setting the configuration variables to true in-
hardware multipliers), and the modular reduction, commonly cludes extensions into the architecture while a value of false
implemented using the Low-High Lemma [22], requires ad- excludes them from the architecture. All of the added func-
ditional execution time. tional units that support the proposed instruction set exten-
AES software performance bottlenecks typically occur in sions have been included in the IU. Component declarations
the SubBytes and MixColumns transformations, one or both were added for each of the added hardware units and instan-
of which are usually implemented via 8-bit to 8-bit look- tiated as part of the arithmetic logic unit. The decode stage
up tables. Often most of the AES round transformations of the IU pipeline sets flags for the instruction set extensions
— SubBytes, ShiftRows, and MixColumns — are combined and generates source register and immediate data. On the
into large look-up tables termed T-tables. Such implemen- next clock cycle, the execute stage passes the input operands
tations require up to three T-tables whose size may be ei- to the appropriate functional unit based on the instruction flag
ther 1 Kbytes or 4 Kbytes where the smaller tables require set. The result is then read from the functional unit when the
performing an additional rotation operation. The goal of the instruction specifies a destination register to receive an out-
T-tables is to avoid performing the MixColumns and InvMix- put.
Columns transformations as these operations perform Galois The SPARC R
assembly instruction opcodes are defined
Field fixed field constant multiplication, an operation which in the /leon/sparcv8.vhd module. The code added to this
maps poorly to general-purpose processors. However, the module include the values for the op3 field of the new in-
use of T-tables has a number of disadvantages. T-tables sig- structions. Provided in the LEON2 base package is a generic
nificantly increase code size, their performance is dependent test bench with disassembly support. During functional sim-
on the memory system architecture as well as cache size, and ulation, assembly instructions are printed out to the simula-
their use causes key expansion for AES decryption to be- tion software’s console window as they are executed. The
come significantly more complex. As an alternative to the module /leon/debug.vhd contains the functions that handle
use of T-tables, it is also feasible to have the processor per- display of the instruction strings, and disassembly support
form all of the AES round transformations. Row-based im- for the new instructions was added. When the op and op3
plementations have been demonstrated to allow for greater instructions match those of one of the instruction set exten-
efficiency in the implementation of the MixColumns and In- sions, the instruction name is printed followed by any appli-
vMixColumns transformations versus column-based imple- cable source/destination registers or immediate data for that
mentations. However the SubBytes transformation still re- instruction.
mains as a bottleneck, requiring separate 256-byte look-up
tables for encryption and decryption [1], [6], [33], [34]. 5.1 DES and Triple-DES
5 Proposed Instruction Set Extensions The desipl and desipr instructions produce the left and
right halves of the Initial Permutation, respectively. Simi-
All of the proposed instruction set extensions comply with larly, the desfpl and desfpr instructions produce the left and
the SPARC R
V8 instruction model using the Format 3 struc- right halves of the Final Permutation, respectively. The left
ture. All instructions that write to a register execute in one half of the input block must be located in the rs1 register and
clock cycle except for the mmul16 instruction which re- the right half must be located in the rs2 register. Inclusion of
quires two clock cycles. For those instructions that store data these instructions allows the Initial and Final Permutations to
directly into registers contained in the hardware added to the be completed in two instructions each. Traditional software
datapath, the data is available at the start of the next cycle, implementations require a series of bit mask setup, shift, log-
after instruction execution has completed. Building the set ical AND, and logical OR operations for each bit for a total
of development tools from the source files is necessary when of 256 instructions [38]. Even the improved permutation al-
extending the instruction set of the LEON2 processor. The gorithm used in [20] requires 44 instructions to complete on
source code archive for BCC v1.0.29c includes specific mod- a SPARC R
V8 processor such as the LEON2.
ifications made to two different versions of GNU binutils to The desdir instruction sets up the Key Generator to out-
support the LEON2 processor. This research employs the put round keys in either encryption or decryption order. This
v2.16.1 binutil. The file /opcodes/sparc-opc.c was edited to instruction also resets the round counter of the Key Genera-
include op3 values for the instruction set extensions. tor according to the chosen direction to ensure that output of
All added hardware modules are coded in VHDL and all the round keys may be immediately carried out in the proper
inputs and outputs are read from and written to the LEON2 order. It is not necessary to re-load the master key after this
pipelined integer unit (IU) register file. None of the added instruction is executed. The desdir instruction is used in con-
logic circuits rely on external memory for their functional- junction with the deskey and desf instructions. The deskey
ity. A new module was included with the VHDL source for instruction loads the 64-bit master key. The left half of the
the LEON2 processor architecture to provide an easy way to master key must be contained in the rs1 register, and the right
select specific extensions to be included in the architecture. half in the rs2 register. The desf instruction takes the right
For the AES S-Box extensions, the available options are no half of a round output block stored in the rs1 register and
468
stores the output of the f -function into the rd register. The registers. The 16-bit product is stored in the lower sixteen
round key is not specified here since the round key output bits of the rd register. The Modulo 216 + 1 Multiplication
of the Key Generator is hard-wired to the f -Function unit’s unit is designed based on the adder-based modular multiplier
round key input. After completion of this instruction, the Key in [2]. The Modulo 216 + 1 Multiplication unit first gener-
Generator is signaled to generate the key for the next round. ates partial products reduced modulo 216 + 1 as described
Due to the logic of the Key Generator, the desf instruction in Zimmerman’s investigation of efficient architectures for
may not be followed by another desf instruction. However, arithmetic modulo 2n ± 1 [41]. Each partial product is de-
this is not expected to cause a performance bottleneck due to termined by the formula:
the additional instruction required for swapping the values of
the left and right halves of the round input block. Implemen-
tation of the desdir, deskey, and desf instructions removes P Pi = xi · y15−i · · · y0 ȳ15 · · · ȳ16−i + x̄i · 0 · · · 01 · · · 1 (1)
the need for storage of the sixteen round keys and S-Boxes
in memory. All round keys are generated on-the-fly in the where the vector 0 · · · 01 · · · 1 contains 16 − i zeros and i
hardware added to the datapath. An implementation of the ones. In order to handle cases where x = 0 or y = 0, a
DES algorithm using these instructions requires two instruc- correction term k is defined:
tions for key scheduling and four instructions for each of the ⎧
⎪
⎪ 2 if x = 0 and y = 0,
sixteen rounds — one desf, one bit-wise XOR, and two reg- ⎨
x̄ + 3 if x = 0 and y = 0,
ister data transfers for swapping the left and right halves of k= (2)
⎪
⎪ ȳ + 3 if x = 0 and y = 0,
the round function output. ⎩
1 if x = 0 and y = 0.
The Permutation Unit implements the Initial Permutation
15
and Final Permutation. The inputs are loaded from the source An intermediate sum s = k + i=0 P Pi is then computed.
registers specified in the permutation instruction. The inputs The final step is a reduction of s modulo 216 + 1. Defining
are passed through two stages of 2-to-1 multiplexers. The sL to be the sixteen least significant bits and sH to be the
first stage selects the rearranged bits of either the Initial Per- remaining high-order bits of s, the result of the multiplication
mutation or the Final Permutation output. The output of the is s mod (216 + 1) = sL + 216 sH . This result may be
selected permutation is represented by the pair of 32-bit vec- reduced such that s mod (216 + 1) ≡ sL + s¯H + 2.
tors. The second stage of multiplexers sets the final output Using the Low-High Lemma for reduction modulo 216 + 1
to either the left half or the right half of the output. The Key assuming that t = sL + s¯H + 1 [22]:
Generator unit was designed to work in conjunction with the
f -Function unit. All registers are sensitive to the rising edge
(sL + s¯H + 2) mod 216 if t < 216 ,
of the clock. The 64 key bits, which are loaded from the s mod (216 + 1) = (sL + s¯H + 1) mod 216 if t ≥ 216 .
(3)
source registers specified in the deskey instruction, are rear-
ranged according to the PC-1 mapping. A 64-bit master key Additions are implemented in the Modulo 216 + 1 Mul-
is loaded by issuing the deskey instruction. When this in- tiplication unit with a carry-propagate adder tree. A generic
struction is in the execute stage, the key bits are loaded into model is specified in a VHDL source file separate from the
the C0 and D0 registers. When performing encryption, the multiplier source file. The generic component design allows
C and D registers are loaded with the values of C0 and D0 for adjustable width of the inputs.
rotated left by 1 bit since the first values of Ci and Di used
in the encryption key schedule are C1 and D1 . When per- 5.3 AES
forming decryption, the C and D registers are loaded with
the exact values of C0 and D0 because the final Ci and Di The aessb, aessbs, aessb4, and aessb4s instructions per-
values used in the encryption key schedule are the first values form the SubBytes and InvSubBytes operations on either
used in the decryption key schedule. The f -Function unit’s one or four of the bytes in the rs1 register. The instruction
32-bit input port for Ri−1 is loaded from the source register operands determine if the SubBytes operation or the InvSub-
operand specified in the desf instruction. The 48-bit input Bytes operation is to be performed. In the case of the single
port for the current round key is generated by the Key Gen- S-Box instructions, the operand also specifies which of the
erator unit. Expansion and permutation blocks are imple- four bytes in the rs1 register is to be operated upon. The
mented by rerouting the inputs and the S-Boxes are defined aessbs instruction provides the additional functionality of al-
as logic-based mappings. The output of the f -Function unit lowing the user to specify the destination byte in the rd reg-
is stored in the destination register specified in the rd field of ister while the aessb4s instruction allows the user to spec-
the desf instruction. ify the number of bytes to left-shift the 4-byte result prior to
storage in the destination register. The gfmkld instruction is
5.2 IDEA used to load one of the sixteen constants into the 4 × 4 con-
stant matrix of the Galois Field fixed field constant matrix
The mmul16 instruction computes rs1 · rs2 mod 216 + 1 multiplier. The constants are loaded row by row, beginning
and stores the product in the rd register. Both source with row zero and proceeding in order to row three. Each
operands must be in the lower sixteen bits of their respective row is loaded beginning with the constant from column zero
469
and proceeding in order to the constant in column three. Due decryption. The LEON2 processor implementation was syn-
to the logic that has been added to the multiplier for inclusion thesized targeting the Xilinx Virtex-4 XC4VLX25 FPGA us-
into the LEON2 processor datapath, instances of the gfmkld ing the Xilinx ISE 8.1i tools.
instruction may not be issued consecutively. The gfmmul in-
struction performs the Galois Field fixed field constant ma- 6.1 Code Size
trix multiplication on the input in the rs1 register and stores
the result in the rd register. The following tables present executable code sizes for im-
The SubBytes and InvSubBytes S-Boxes are implemented plementations of the target algorithms with different combi-
as logic-based mappings in hardware. The dir signal se- nations of instruction set extensions. For the DES and Triple-
lects the S-Box output. The Galois Field Fixed Field Con- DES algorithms, the permutation instructions decrease the
stant Multiplier unit performs the MixColumns or InvMix- total code size by up to a factor of 1.3 but have no effect on
Columns operation. The architecture of this multiplier is de- the key schedule as permutations are not used in the compu-
scribed in [9], [8]. The MixColumns and InvMixColumns tation of the round keys. Instructions supporting the round
operations are a matrix multiplication over the Galois Field key generation and round function have a much more pro-
GF(28 ) on each column of the state by a 4 × 4 fixed field nounced impact on code size. When these instructions are
constant matrix. This means that a total of sixteen multi- used, all of the lengthy permutation routines and memory-
plications in the Galois Field GF(28 ) must be performed to based S-Boxes are no longer needed. Encryption and decryp-
complete the entire operation. Each product must then be re- tion code size is reduced by up to a factor of 4.0 in the case
duced modulo m(x) = x8 + x4 + x3 + x + 1, the irreducible of DES and a factor of 3.5 in the case of Triple-DES with the
polynomial specified for AES. To accomplish the multipli- use of these instructions alone. Key scheduling is handled
cation and modular reduction simultaneously, the operation via two instructions, making the respective code size a small
can be represented as an 8 × 8 matrix multiplication over percentage of the remaining program code needed to imple-
the Galois Field GF(2). The constants in the inner matrix ment encryption and decryption. When all of the instruction
are determined by the constant factor in the multiplication set extensions are implemented, DES code size is reduced by
and the polynomial m(x). Consider the representative Ga- factors of up to 31.2 in non-feedback mode and 21.1 in feed-
lois Field GF(28 ), used by AES in the MixColumns and In- back mode and Triple-DES code size is reduced by factors of
vMixColumns transformations. Note that [A3 : A0 ] are the up to 15.9 in non-feedback mode and 13.0 in feedback mode.
input bytes and [B3 : B0 ] are the output bytes [7]: Note that the decryption key schedule code size data for
IDEA includes that of the encryption key schedule. This is
⎡ ⎤ ⎡ ⎤⎡ ⎤
B0 K00 K01 K02 K03 A0 because the decryption keys are determined from the encryp-
⎢ B1 ⎥ ⎢ K10 K11 K12 K13 ⎥ ⎢ A1 ⎥ tion keys. The key schedule for decryption sees a slight de-
⎢ ⎥ ⎢ ⎥⎢ ⎥
⎣ B2 ⎦ = ⎣ K20 K21 K22 K23 ⎦ ⎣ A2 ⎦ (4) crease in code size with the use of the mmul16 instruction
B3 K30 K31 K32 K33 A3 because of the need to compute multiplicative inverses mod-
ulo 216 . The addition of the mmul16 instruction significantly
The core operation in this fixed field multiplication is an decreases the code size of encryption and decryption by fac-
8-bit inner product that must be performed sixteen times, tors of up to 2.8 in non-feedback mode and 2.4 in feedback
four per row. The four inner products of each row are then mode. Note that due to the absence of a hardware multiplier
combined via a bit-wise XOR operation to form the final in the LEON2 integer unit, the multiplication is performed
output word. For a known primitive polynomial p(x), k(x) by a library function.
(representing the 8-bit constant), and a generic input a(x), The AES instruction set extensions aessbs and aessb4s
the resultant polynomial equation takes the form b(x) = yielded either equivalent or reduced code size for both
a(x) × k(x) mod p(x) where each coefficient of b(x) is column-oriented and row-oriented implementations versus
a function of a(x). This results in an 8-bit × 8-bit matrix the aessb and aessb4 instruction set extensions. Using one
representing the coefficients of b(x) in terms of a(x) [7]. An S-Box led to the largest reduction in code size for column-
8-bit × 8-bit matrix must be generated for each Kxy , result- oriented implementations by up to a factor of 1.8 for encryp-
ing in a total of sixteen matrices. Note that this analysis holds tion and 1.6 for decryption. In the case of the row-oriented
true for Galois Fields other than GF(28 ) with corresponding implementations, the use of four S-Boxes led to the largest
adjustments to the mapping matrix. reduction in code size by up to a factor of 2.6 for encryp-
tion and 2.2 for decryption. However, the use of one S-Box
6 Analysis of Results results in significantly reduced key scheduling code size for
row-oriented implementations.
Functional verification was performed using twelve test When only the gfmmul instruction is incorporated, code
vectors for each algorithm. Performance testing measured size decreases by up to a factor of 1.3 for encryption and
the execution cycles required to perform one iteration of 1.8 for decryption in the column-oriented implementations.
the target algorithm. Each algorithm was tested in both The original code used to implement the InvMixColumns op-
non-feedback (Electronic Code Book) and feedback (Cipher eration for decryption requires many more operations than
Block Chaining) modes of operation for both encryption and the MixColumns operation used by encryption. Only four
470
instances of the gfmmul instruction are needed to perform encryption is unaffected, the use of the mmul16 instruction
both operations – one for each column of the AES state. results in a decryption key scheduling speedup by a factor of
The use of the gfmmul instruction results in up to a factor 7.8. Similarly, the use of the mmul16 instruction results in
of 1.1 increase in code size for encryption and 1.1 decrease speedups of IDEA by up to a factor of 7.7 in non-feedback
in code size for decryption in the row-oriented implemen- mode and 7.1 in feedback mode.
tations. This occurs because the MixColumns and InvMix- The AES instruction set extensions aessbs and aessb4s
Columns operations operate on the columns of the AES state, yielded either equivalent or reduced execution cycles for both
requiring additional instructions to rearrange the bytes prior column-oriented and row-oriented implementations versus
to being processed by the gfmmul instruction in the row- the aessb and aessb4 instruction set extensions. Using one
oriented implementations. The effect of the byte rearrange- S-Box led to the largest speedup for column-oriented imple-
ment is that column-oriented implementations require signif- mentations by up to a factor of 1.4 for encryption and 1.2 for
icantly smaller code space than the row-oriented implemen- decryption. Using four S-Boxes led to the largest speedup
tations when all of the instruction set extensions are imple- for row-oriented implementations by up to a factor of 1.9 for
mented. For column-oriented implementations, code size is encryption and 1.6 for decryption. However, the use of one
reduced by up to a factor of 3.6 for encryption and 4.8 for S-Box reduced the key scheduling execution cycles for row-
decryption using one S-Box while for row-oriented imple- oriented implementations.
mentations, code size is reduced by up to a factor of 2.1 for Incorporating only the gfmmul instruction yields
encryption and 2.3 for decryption using four S-Boxes. speedups of up to a factor of 1.8 for encryption and 3.0
for decryption in the column-oriented implementations.
Key and
Base Permute f-Function All The original code used to implement the InvMixColumns
Operation Code ISEs ISEs ISEs
DES Key Sched 1524 1524 8 8 operation for decryption requires many more operations,
DES Enc ECB 1996 1572 500 68 and thus cycles, versus the MixColumns operation used by
DES Enc CBC 2028 1604 532 100
DES Dec ECB 1996 1572 496 64 encryption. The use of the gfmmul instruction results in
DES Dec CBC 2028 1604 528 96
Triple-DES Key Sched 1524 1524 24 24 up to a factor of 1.1 decrease in performance for encryp-
Triple-DES Enc ECB 2096 1656 588 132
Triple-DES Enc CBC 2128 1688 620 164 tion and 1.04 speedup for decryption in the row-oriented
Triple-DES Dec ECB 2096 1656 588 132 implementations. This occurs because the MixColumns and
Triple-DES Dec CBC 2128 1688 620 164
InvMixColumns operations operate on the columns of the
Table 1. DES and Triple-DES code size (bytes) AES state, requiring additional cycles to rearrange the bytes
prior to being processed by the gfmmul instruction in the
Modulo 216 + 1 row-oriented implementations. The effects of the byte rear-
Base Multiplication rangement is such that the speedups of the column-oriented
Operation Code ISE
Enc Key Sched 436 436 implementations are significantly larger than the speedups of
Dec Key Sched 844 760
Enc ECB 596 212 the row-oriented implementations when all of the instruction
Enc CBC 660 276 set extensions are implemented and row-oriented imple-
Dec ECB 596 212
Dec CBC 660 276 mentations perform better when the gfmmul instruction is
Table 2. IDEA code size (bytes) combined with one S-Box instead of four S-Boxes. Column-
oriented implementations yield speedups by a factor of 4.0
for encryption and 6.6 for decryption using one S-Box while
6.2 Execution Cycles row-oriented implementations yield speedups by a factor of
1.4 for encryption and 1.7 for decryption.
The following tables present the number of clock cycles
Key and
required to complete a full iteration of each algorithm. For Base Permute f-Function All
Operation Code ISEs ISEs ISEs
the DES and Triple-DES algorithms, the permutation in- DES Key Sched 2917 2917 9 9
structions alone have virtually no impact on the execution DES Enc ECB
DES Enc CBC
4330
4342
4215
4227
252
263
133
144
cycles for both DES and Triple-DES. However, the instruc- DES Dec ECB
DES Dec CBC
4330
4343
4215
4228
250
263
132
144
tions supporting round key generation and the round func- Triple-DES Key Sched 8750 8750 24 24
tion have a significant impact on execution cycles, yielding Triple-DES Enc ECB 12993 12648 752 396
Triple-DES Enc CBC 13004 12659 764 407
speedups for DES by a factor of up to 17.3 in non-feedback Triple-DES Dec ECB 12993 12648 751 395
Triple-DES Dec CBC 13004 12659 764 407
mode and 16.5 in feedback mode and speedups for Triple-
DES by a factor of up to 17.3 in non-feedback mode and Table 4. DES and Triple-DES execution cycles
17.0 in feedback mode. Implementation of all of the instruc-
tion set extensions yields speedups for DES by a factor of Table 7 details execution cycle counts for some combi-
up to 32.6 in non-feedback mode and 30.2 in feedback mode nations of extensions are compared with results published
and speedups for Triple-DES by a factor of up to 32.8 in non- by the ISEC project [33]. Their work includes the use of
feedback mode and 32.0 in feedback mode. logic-based mappings as a choice of hardware implementa-
The use of the mmul16 instruction significantly decreases tion for the S-Boxes. The sbox and sbox4 instructions from
the IDEA execution cycle count. While key scheduling for the ISEC project are equivalent to the aessbs and aessb4 in-
471
1 S-Box and 4 S-Boxes and
Base 1 S-Box 4 S-Boxes GF Mult GF Mult GF Mult
Operation Code ISEs ISEs ISEs ISEs ISEs
Col Oriented Key Sched 588 176 164 588 176 164
Col Oriented Enc ECB 1404 764 908 1052 392 520
Col Oriented Enc CBC 1468 828 972 1116 456 584
Col Oriented Dec ECB 1872 1200 1408 1048 392 528
Col Oriented Dec CBC 1936 1264 1472 1112 456 592
Row Oriented Key Sched 564 256 564 564 256 564
Row Oriented Enc ECB 1220 620 476 1328 540 584
Row Oriented Enc CBC 1284 684 540 1392 604 648
Row Oriented Dec ECB 1328 692 592 1296 668 568
Row Oriented Dec CBC 1392 756 656 1360 732 632
Table 3. AES code size (bytes)
1 S-Box and 4 S-Boxes and

Base 1 S-Box 4 S-Boxes GF Mult GF Mult GF Mult
Operation Code ISEs ISEs ISEs ISEs ISEs
Col Oriented Key Sched 519 376 346 519 376 346
Col Oriented Enc ECB 1742 1250 1423 962 435 549
Col Oriented Enc CBC 1768 1276 1449 989 462 576
Col Oriented Dec ECB 2872 2360 2638 963 435 561
Col Oriented Dec CBC 2899 2387 2665 990 462 588
Row Oriented Key Sched 669 606 669 669 606 669
Row Oriented Enc ECB 1335 926 717 1573 751 971
Row Oriented Enc CBC 1361 952 743 1600 778 998
Row Oriented Dec ECB 1622 1193 1026 1557 1132 964
Row Oriented Dec CBC 1649 1220 1053 1584 1159 991
Table 6. AES execution cycles
Modulo 216 + 1 and combinational logic used to compute the matrix product,
Base Multiplication
Operation Code ISE the Galois Field matrix multiplier is the largest of the added
Enc Key Sched 894 894
Dec Key Sched 43005 5510 hardware units. The implementation with all of the proposed
Enc ECB 2580 337 instruction set extensions leads to a total area increase over
Enc CBC 2602 365
Dec ECB 2577 337 the baseline configuration by approximately 63%. The mod-
Dec CBC 2602 365
ulo 216 + 1 multiplier for IDEA was the single largest
Table 5. IDEA execution cycles contributor to decreases in the implementation’s maximum
operating frequency. A purely combinational design of the
multiplier had large path delays due to the carry-propagate
structions presented in this research. The mixcol4 instruc- adder structure, yielding a maximum operating frequency of
tion uses an entire 32-bit word as input into an AES-specific only 72 MHz. The multiplier was modified to create a two
MixColumns/InvMixColumns functional unit. In order to cycle instruction implementation to decrease the effect on
address the ShiftRows operation, the instructions sbox4s the processor clock frequency. With all extensions imple-
and mixcol4s combine an implicit ShiftRows operation with mented, the LEON2 processor implementation resulted in a
SubBytes and MixColumns respectively. This functionality clock frequency of approximately 117 MHz, a 10% decrease
is similar but not identical to the combination of the aessb4s compared to the baseline implementation.
and gfmmul instructions. CLB Freq
Slices F/Fs (MHz)
Enc Enc Dec Dec Baseline LEON2 Implementation 4563 1803 129.929
Implementation Cyc SU Cyc SU DES
No Extensions (ISEC) 1637 1.0 1955 1.0 Permutation 5200 2372 128.083
No Extensions (UML) 1335 1.0 1622 1.0 Keygen/f-Function 5498 2477 122.786
4 S-Boxes sbox4 (ISEC) 1020 1.6 1435 1.4 All Units 5543 2488 123.880
4 S-Boxes aessb4 (UML) 837 1.6 1117 1.5 IDEA mod (216 + 1) Multiplier 5411 2398 118.162
GF Mult mixcol4 (ISEC) 939 1.7 970 2.0 AES
GF Mult gfmmul (UML) 962 1.4 963 1.7 1 S-Box 5367 2368 128.996
sbox and mixcol4 (ISEC) 458 3.6 458 4.3 1 S-Box (S) 5281 2336 128.594
aessbs and gfmmul (UML) 435 3.1 435 3.7 4 S-Boxes 5657 2380 127.369
4 S-Boxes (S) 6633 2910 125.639
sbox4s and mixcol4s (ISEC) 458 3.6 459 4.3 GF Multiplier 5681 3415 120.766
aessb4s and gfmmul (UML) 549 2.4 561 2.9 1 S-Box and GF Multiplier 5862 3399 126.787
1 S-Box (S) and GF Multiplier 5920 3416 123.896
4 S-Boxes and GF Multiplier 6174 3414 121.422
Table 7. AES execution cycles comparison 4 S-Boxes (S) and GF Multiplier 6420 3543 122.134
LEON2 with All Extensions 7447 3766 116.828
6.3 Hardware Resource Requirements Table 8. Hardware Utilization — Xilinx

XC4VLX25 FPGA
Table 8 shows the component usage of the Xilinx Virtex-
4 XC4VLX25 FPGA by each added functional unit and the 6.4 Algorithm Throughput Comparisons
total utilizations of the LEON2 processor with combinations
of instruction set extensions for each targeted algorithm. Due Table 9 presents throughput data for baseline implementa-
to the number of storage bits needed for matrix configuration tions of the target algorithms versus implementations of the
472
target algorithms using the proposed instruction set exten- 8 Acknowledgement
sions when performing encryption in non-feedback mode.
The results are analyzed with respect to the hardware re- We would like to thank Stefan Tillich and Johann
sources required for each implementation to determine the Großchädl from the Graz University of Technology for their
hardware cost associated with improving the execution time useful discussions regarding their instruction set extensions
of the targeted algorithms using the proposed instruction targeting the LEON2 processor.
set extensions. Throughput data is presented in terms of
Megabits per second (Mbps), hardware usage is presented References
in terms of FPGA configurable logic block (CLB) slices, and
throughput/area (T/A) ratios are presented in bits per second [1] G. Bertoni, L. Breveglieri, P. Fragneto, M. Macchetti, and
per slice. S. Marchesin. Efficient Software Implementation of AES on
32-Bit Platforms. In B. S. K. Jr., Ç. K. Koç, and C. Paar,
The data shows that the instruction set extensions yielded editors, Workshop on Cryptographic Hardware and Embed-
the best improvement for DES in terms of T/A ratio, with ded Systems — CHES 2002, volume LNCS 2523, pages 159–
the ratio increasing by a factor of 25.6. This increase is a re- 171, Redwood Shores, California, USA, August 13–15 2002.
Springer-Verlag.
sult of nearly all of the DES functionality being off-loaded to [2] J. Beuchat. Modular Multiplication for FPGA Implemen-
the added hardware. As expected, T/A ratios for Triple-DES tation of the IDEA Block Cipher. In Proceedings of the
were approximately 13 of the corresponding values for DES Fourteenth IEEE International Conference on Application-
Specific Systems, Architectures and Processors — ASAP
because both algorithms require the same hardware in order 2003, pages 412–422, The Hague, The Netherlands, June 24–
to be implemented using the instruction set extensions. The 26 2003.
[3] E. Biham. A Fast New DES Implementation in Software.
T/A ratio for Triple-DES increased by a factor of 25.8 when In E. Biham, editor, Fourth International Workshop on Fast
instruction set extensions are used, matching the increase ev- Software Encryption, volume LNCS 1267, pages 260–272,
idenced by DES. In the case of IDEA, the instruction set Haifa, Israel, January 20–22 1997. Springer-Verlag.
[4] J. Burke, J. McDonald, and T. M. Austin. Architectural Sup-
extensions had a significant impact upon performance at a port for Fast Symmetric-Key Cryptography. In Proceedings
reasonable hardware cost, with the measured T/A ratio in- of the Ninth International Conference on Architectural Sup-
creasing by a factor of 5.9. For AES, the instruction set ex- port for Programming Languages and Operating Systems —
AS-PLOS 2000, pages 178–189, Cambridge, Massachusetts,
tensions used in conjunction with the column-oriented im- USA, November 12–15 2000.
plementations yield larger increases in throughput versus the [5] F. Crowe, A. Daly, T. Kerins, and W. Marnane. Single-Chip
instruction set extensions used in conjunction with the row- FPGA Implementation of a Cryptographic Co-processor. In
O. Diessel and J. Williams, editors, Proceedings of the
oriented implementations. As a result, the column-oriented 2004 IEEE International Conference on Field-Programmable
implementations yielded a greater increase in T/A ratio and Technology — FPT 2004, pages 279–285, Brisbane, Aus-
are superior to the row-oriented implementations when using tralia, December 6–8 2004.
[6] J. Daemen and V. Rijmen. The Design of Rijndael. Springer,
the proposed instruction set extensions. New York, New York, USA, 2002.
[7] A. J. Elbirt. Reconfigurable Computing for Symmetric-Key
Algorithms. PhD thesis, Worcester Polytechnic Institute,
7 Conclusions Worcester, Massachusetts, USA, April 2002. Available at
http://faculty.uml.edu/aelbirt/thesis.pdf.
[8] A. J. Elbirt. Efficient Implementation of Galois Field Fixed
Instruction set extensions for improving software im- Field Constant Multiplication. In Proceedings of the Interna-
plementations of symmetric-key algorithms have been pro- tional Conference on Information Technology: New Genera-
posed. Existing literature on the subject of enhancing the per- tion — ITNG ’06, pages 172–177, Las Vegas, Nevada, USA,
April 10–12 2006.
formance of symmetric-key algorithms was discussed, fol- [9] A. J. Elbirt. Fast and Efficient Implementation of AES Via
lowed by detailed descriptions of the targeted processor and Instruction Set Extensions. In Proceedings of the Third
the targeted cryptographic algorithms. Descriptions of the IEEE International Symposium on Security in Networks and
Distributed Systems, pages 396–403, Niagara Falls, Canada,
custom instructions were given along with the functional May 21–23 2007.
units that implement the underlying logical and arithmetic [10] P. Ganesan, R. Venugopalan, P. Peddabachagari, A. Dean,
operations. The results show that the proposed instructions F. Mueller, and M. Sichitiu. Analyzing and Modeling Encryp-
tion Overhead for Sensor Network Nodes. In Proceedings of
have a significant positive effect on program code size for the Second ACM International Conference on Wireless Sen-
all of the targeted algorithms, shrinking the number of code sor Networks and Applications — WSNA ’03, pages 151–159,
bytes by up to a factor of 31.2. Execution time for all algo- San Diego, California, USA, September 19 2003.
[11] A. V. Garcia and J.-P. Seifert. On the Implementation of
rithms was also improved, with demonstrated speedups by the Advanced Encryption Standard on a Public-key Crypto-
factors of up to 32.8. The instruction set extensions required Coprocessor. In Proceedings of the Fifth Smart Card Re-
only a 63% increase in the logic utilization of the LEON2 search and Advanced Application Conference — CARDIS
’02, pages 135–146, San Jose, California, USA, Novem-
processor on the chosen FPGA device while decreasing the ber 21–22 2002.
maximum clock frequency by approximately 10%. All of [12] P. Gil. How Big is the Internet? World Wide Web
— http://netforbeginners.about.com/cs/technoglossary/f/
the targeted algorithms evidenced an increase in T/A ratio, FAQ3.htm, 2005.
increasing by a factor of up to 25.8. Finally, it was demon- [13] M. Gschwind. Instruction Set Selection for ASIP Design.
strated that column-oriented implementations of AES are In A. A. Jerraya, L. Lavagno, and F. Vahid, editors, Pro-
ceedings of the Seventh International Symposium on Hard-
superior to row-oriented implementations for all evaluation ware/Software Codesign — CODES’99, pages 7–11, Rome,
metrics when using the proposed instruction set extensions. Italy, March 1999.
473
Base Code With ISEs With ISEs
Algorithm Throughput CLB Slices T/A Ratio Throughput CLB Slices T/A Ratio
DES 1.92 4563 420.78 59.61 5543 10754.10
Triple-DES 0.64 4563 140.26 20.07 5543 3620.78
IDEA 3.23 4563 707.87 22.44 5411 4147.11
AES-Columns 9.55 4563 2092.92 28.89 5920 4880.07
AES-Rows 12.47 4563 2732.85 16.45 5920 2778.72
Table 9. Throughput to area ratios
[14] J.-O. Haenni. Architecture EPIC et Jeux d’Instructions Mul- [29] D. A. Osvik. Efficient Implementation of the Data Encryption
timédias Pour Applications Cryptographiques. PhD thesis, Standard. PhD thesis, Universitatis Bergensis, April 2003.
Swiss Federal Institute of Technology, Lausanne, Switzer- [30] A. Pfitzmann and R. Assman. More Efficient Software Im-
land, 2002. plementations of (Generalized) DES. Computers & Security,
[15] A. Hodjat, D. D. Hwang, B. Lai, K. Tiri, and I. Ver- 12(5):477–500, 1993.
bauwhede. A 3.84 GBits/s AES Crypto Coprocessor with [31] S. Ravi, A. Raghunathan, N. Potlapally, and M. Sankaradass.
Modes of Operation in a 0.18-µm CMOS Technology. In System Design Methodologies for a Wireless Security Pro-
Proceedings of the Fifteenth ACM Great Lakes symposium on cessing Platform. In Proceedings of the 2002 Design Automa-
VLSI — GLSVLSI ’05, pages 60–63, Chicago, Illinois, USA, tion Conference — DAC 2002, pages 777–782, New Orleans,
April 17–19 2005. Louisiana, USA, June 10–14 2002.
[16] A. Hodjat and I. Verbauwhede. Interfacing a High Speed [32] S. Tillich and J. Großchädl. Accelerating AES Using In-
Crypto Accelerator to an Embedded CPU. In Proceedings of struction Set Extensions for Elliptic Curve Cryptography. In
the 38th Asilomar Conference on Signals, Systems, and Com- O. Gervasi, M. L. Gavrilova, V. Kumar, A. Laganà, H. P. Lee,
puters, volume 1, pages 488–492, Los Angeles, California, Y. Mun, D. Taniar, and C. J. K. Tan, editors, International
USA, November 7–10 2004. Conference on Computational Science and Its Applications
[17] J. Hughes. Implementation of NBS/DES Encryption Algo- — ICCSA 2005, volume LNCS 3481, pages 665–675, Singa-
rithm in Software. In Colloquium on Techniques and Impli- pore, May 9–12 2005. Springer-Verlag.
cations of Digital Privacy and Authentication Systems, 1981. [33] S. Tillich and J. Großchädl. Instruction Set Extensions
[18] J. Irwin and D. Page. Using Media Processors for Low- for Efficient AES Implementation on 32-bit Processors. In
Memory AES Implementation. In Proceedings of the L. Goubin and M. Matsui, editors, Workshop on Crypto-
Fourteenth IEEE International Conference on Application- graphic Hardware and Embedded Systems — CHES 2006,
Specific Systems, Architectures and Processors — ASAP volume LNCS 4249, pages 270–284, Yokohama, Japan, Oc-
2003, pages 144–154, The Hague, The Netherlands, June 24– tober 10–13 2006. Springer-Verlag.
26 2003. [34] S. Tillich and J. Großchädl and A. Szekely. An Instruction
[19] K. Küçükçakar. An ASIP Design Methodology for Embed- Set Extension for Fast and Memory-Efficient AES Implemen-
ded Systems. In A. A. Jerraya, L. Lavagno, and F. Vahid, tation. In J. Dittmann, S. Katzenbeisser, and A. Uhl, editors,
editors, Proceedings of the Seventh International Symposium Proceedings of the Ninth International Conference on Com-
on Hardware/Software Codesign — CODES’99, pages 17– munications and Multimedia Security — CMS 2005, volume
21, Rome, Italy, March 1999. LNCS 3677, pages 11–21, Salzburg, Austria, September 19–
[20] P. Karn. DES Software Implementation. 21 2005. Springer-Verlag.
http://http://www.citi.umich.edu/projects/apv/. [35] G. A. Sathishkumar and C. Prasanna. A Novel VLSI Archi-
[21] H. W. Kim and S. Lee. Design and Implementation of a Pri- tecture for an Integrated Crypto Processor. In Proceedings
vate and Public Key Crypto Processor and its Application to a of the 2005 Annual IEEE INDICON Conference, pages 272–
Security System. IEEE Transactions on Consumer Electron- 275, Chennai, India, December 11–13 2005.
ics, 50(1):214–224, February 2004. [36] P. Schaumont, K. Sakiyama, A. Hodjat, and I. Verbauwhede.
[22] X. Lai and J. Massey. A Proposal for a New Block Encryp- Embedded Software Integration for Coarse-Grain Reconfig-
tion Standard. In I. B. Damgård, editor, Advances in Cryptol- urable Systems. In Proceedings of the Eighteenth Inter-
ogy — EUROCRYPT ’90, volume LNCS 473, pages 389–404, national Parallel and Distributed Processing Symposium —
Berlin, Germany, May 1990. Springer-Verlag. IPDPS 2004, pages 137–142, Santa Fe, New Mexico, USA,
[23] R. B. Lee. Accelerating Multimedia with Enhanced Micro- April 26–30 2004.
processors. IEEE Micro, 15(2):22–32, April 1995. [37] B. Schneier. Applied Cryptography. John Wiley & Sons Inc.,
[24] R. B. Lee, Z. Shi, and X. Yang. Efficient Permutation Instruc- New York, New York, USA, 2nd edition, 1996.
tions for Fast Software Crytography. IEEE Micro, 21(6):56– [38] Z. Shi and R. B. Lee. Bit Permutation Instructions for
69, November/December 2001. Accelerating Software Cryptography. In Proceedings of
[25] H. Lipmaa. IDEA: A Cipher for Multimedia Architectures? the Eleventh IEEE International Conference on Application-
In S. Tavares and H. Meijer, editors, Fifth Annual Workshop Specific Systems, Architectures and Processors — ASAP
on Selected Areas in Cryptography, volume LNCS 1556, 2000, pages 138–148, 2000.
Kingston, Ontario, Canada, August 1998. Springer-Verlag. [39] A. Wang, E. Killian, D. E. Maydan, and C. Rowen. Hard-
[26] O. Mencer, M. Morf, and M. J. Flynn. Hardware Soft- ware/Software Instruction Set Configurability for System-
ware Tri-Design of Encryption for Mobile Communication On-Chip Processors. In Proceedings of the 38th Design Au-
Units. In Proceedings of International Conference on Acous- tomation Conference — DAC 2001, pages 184–188, Las Ve-
tics, Speech, and Signal Processing, volume 5, pages 3045– gas, Nevada, USA, June 18–22 2001.
3048, Seattle, Washington, USA, May 1998. [40] L. Wu, C. Weaver, and T. Austin. CryptoManiac: A Fast Flex-
[27] A. Michalski, K. Gaj, and D. A. Buell. High-Throughput ible Architecture for Secure Communication. In B. Werner,
Reconfigurable Computing: A Design Study of an IDEA En- editor, Proceedings of the 28th Annual International Sympo-
cryption Cryptosystem on the SRC-6E Reconfigurable Com- sium on Computer Architecture — ISCA-2001, pages 110–
puter. In T. Rissa, S. J. E. Wilton, and P. H. W. Leong, editors, 119, Goteborg, Sweden, June 30–July 4 2001.
[41] R. Zimmermann. Efficient VLSI Implementation of Modulo
Proceedings of the International Conference on Field Pro-
grammable Logic and Applications — FPL ’05, pages 681– (2n ±1) Addition and Multiplication. Technical report, Swiss
686, Tampere, Finland, August 24–26 2005. Federal Institute of Technology (ETH), Zurich, Switzerland,
[28] D. Oliva, R. Buchty, and N. Heintze. AES and the Cryptonite 1999. http://www.stud.ee.ethz.ch/ zimmi/publications/ mod-
Crypto Processor. In J. H. Moreno, P. K. Murthy, T. M. Conte, ulo arith.ps.gz.
and P. Faraboschi, editors, Proceedings of the 2003 Interna-
tional Conference on Compilers, Architecture and Synthesis
for Embedded Systems — CASES 2003, pages 198–209, San
Jose, California, USA, October 30-November 1 2003.
474

1

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

1

Diunggah oleh

Hak Cipta:

Format Tersedia

2008 Annual Computer Security Applications Conference

Instruction Set Extensions for Enhancing the Performance of Symmetric-Key

Sean O’Melia ∗, Member, IEEE, and AJ Elbirt †, Member, IEEE

Abstract session key and to provide authenticity through digital sig-

1063-9527/08 $25.00 © 2008 IEEE 465

Table 3. AES code size (bytes)

1 S-Box and 4 S-Boxes and

Table 6. AES execution cycles

6.3 Hardware Resource Requirements Table 8. Hardware Utilization — Xilinx

Table 9. Throughput to area ratios

Anda mungkin juga menyukai