Anda di halaman 1dari 8

Comparison of Multipliers Architectures through Emulation and Handle-C

FPGA Implementation
Mahmoud A. Al-Qutayri, Hassan R. Barada and Ahmed Al-Kindi
Etisalat University College, Sharjah, UAE
mqutayri@euc.ac.ae

Abstract operand word length (O(n)), but the time complexity of


tree multipliers is proportional to the logarithm of the
This paper presents a study that compares the operand word length (O(log n)) [7]. The area complexity
architectures and study’s the performance of some of the of both types of multipliers, when physically
major integer multiplication algorithms. This is achieved implemented, is proportional to the square of the operand
through a simulation environment that at this stage word length (O(n2)) [8].
implements four major multipliers in C++ programming In this paper the performance of the four multipliers,
language and implements the same architectures in an namely Hennessy, ripple-carry array, carry-save array and
FPGA prototype system. The environment is a flexible one Wallace tree, is studied through software emulation as
and has a well-designed user interface that makes it well as physical FPGA (field programmable gate array)
suitable for educational use. It enables the user to implementation [9,10]. The software emulation serves
emulate the multipliers for various data inputs and primarily as an educational tool that shows the number
observe the type of operation being executed. Handle-C and type of steps involved in the multiplication process
was used for synthesis and subsequently implementation for each multiplier algorithm. The synthesis process for
of the multipliers in an FPGA. The multipliers involved FPGA implementation is done using Handle-C
in this study are Hennessy, carry save, ripple carry array programming language [11]. An estimate of the required
and Wallace tree. The performance of each multiplier is area for each multiplier is obtained in terms of the
assessed through a study of its area and real-time equivalent number of NAND gates and the real-time
complexities. performance is measured using a logic analyzer.
The paper is organized as follows: section 2 describes
1. Introduction the algorithms and architectures of the multipliers used in
this study. Section 3 outlines the multipliers simulation
The multiplication process is an important operation environment. Section 4 describes the FPGA
that is heavily used in various computation fields such as implementation of the multipliers using Handle-C. This
arithmetic operations, digital communications, digital section includes detailed study of both the area and time
signal processing and multimedia systems. Multiplication complexities of the multipliers. Section 5 presents the
algorithms fall into two broad categories: signed and some conclusions of the study and outlines the future
unsigned. In the signed category the binary numbers to be work.
multiplied are assumed to be signed. An example of this
type is the Booth multiplier. The second category uses 2. Multiplication Algorithms & Architecture
unsigned numbers and encompasses two sub categories:
array multipliers [1-3] and tree multipliers [4-6]. The This section discusses in details the algorithms and
array multipliers include scaling accumulator multiplier architectures of the multipliers that are the subject of this
and carry save array multiplier. The tree multipliers study. It also emulates the operation of each multiplier
include computed partial product multiplier and Wallace through a numerical example. The same example is used
tree multiplier. throughout in order to enable comparison between the
The multiplication algorithms are compared in terms various multipliers.
of their time complexity as well as area complexity. Time
complexity refers to the total number of clock cycles 2.1 Hennessy Multiplier
needed during the execution process of the multiplier.
This in turn depends on the number of add and shift Figure 1 shows the Hennessy multiplier which has a
operation the particular multiplier algorithm requires. regular structure. If the multiplier has a length of n-bit, it
This is also normally a function of the number of bits will need n steps to accomplish the multiplication process;
needed to represent the operands (i.e. n). The time where each step consists of either an add-and-shift
complexity of array multipliers is proportional to the operation or just an add operation depending on the

1-4244-0212-3/06/$20.00/©2006 IEEE 240


content of the multiplicand [15]. An n-bit Hennessy during the design process of the multiplier by using the
multiplier requires n-bit adders. In 4-bit Hennessy block diagram of 4-bit multiplier and extending it to
multiplier, we need a register called A, which is 4-bit and multipliers of larger lengths then counting the number of
initialized to 0. The multiplicand is loaded into a register steps for each multiplier. Each step consists of one or
called M (4-bit and the multiplier is loaded into a register more full-add operations. All full-add operations that run
called Q (4-bit). The single bit register called C is also at the same step must be executed in parallel. The number
initialized to 0. The algorithm consists of steps where the of full adders that are required by an n-bit multiplier is
number of steps needed equals to the multiplier length, (n*(n-1)).
which is 4 in this example. In each step, the first bit in Q
is checked. If it equals 0, combination CAQ is shifted to
the right. If it equals 1 M is added to A, the result is put in
A and then the combination CAQ is shifted to the right.
The previous process is repeated n times.

Figure 2: Architecture of Ripple Carry Array Multiplier

The number of steps needed by the 4-bit Ripple Carry


Figure 1: Architecture of Hennessy Multiplier Array multiplier to perform the multiplication process is 8
(=3*n-4). The required registers are Q (4-bit) to hold the
An example that emulates Hennessy multiplier is given in multiplicand, M (4-bit) to hold the multiplier, S (4-bit),
Table 1 below. In this example the multiplicand=00102 which is the sum register and C (4-bit), which is the carry
(decimal 2) and multiplier=10012 (decimal 9). register. The “,” symbol means a bit-wise AND operation.
The inputs to FA (Full-Adder) in row-1 and column-
Table 1: Emulation of 4-bit Hennessy Multiplier 4 are the third bit stored in the carry register from the
previous step and the result from performing the bit-wise
Registers Content
Operation AND operation between the fourth bit in the multiplicand
C A Q register with the second element stored in the multiplier
0000 0010 Initialization register. The adders, which are represented by the same
0
color in Figure 2 must be executed in parallel. As
0 0000 0001 Step1: Shift to right mentioned the number of steps needed is 8, in step1 the
HA colored in violet is executed. In step 2 the FA colored
0 1001 0001 Step2: A=A+M
in dark blue is executed. In step 3 the adders colored in
0 0100 1000 Shift to right green are executed in parallel. In step 4 the adders colored
in red are executed in parallel. In step 5 the adders colored
0 0010 0100 Step3: Shift to right in yellow are performed simultaneously. In step 6 the FAs
0 0001 0010 Step4: Shift to right colored in pink are executed in parallel. In step7 the FA
colored in blue is executed. In step 8 the FA colored in
The result is 00010010 (decimal 18), which is the content orange is executed.
of the combination AQ The steps involved using a Ripple Carry Array
multiplier of length 4 are emulated in Table 2 below. In
2.2 Ripple Carry Array Multiplier this example the multiplicand=00102 (decimal 2) and
multiplier=10012 (decimal 9).
The Ripple Carry Array multiplier, with its regular
architecture, is shown in Figure 2. The number of steps Table 2: Emulation of 4-bit Ripple Carry Multiplier.
that are required by an n-bit multiplier is (3*n-4). The Step Operation
equation that represents the number of steps was inferred

241
M=1001, Q=0010, S=0000, C=0000 (=2*n-2). The required registers are Q (4-bit) to hold the
Initialization multiplicand, M (4-bit) to hold the multiplier, S (3-bit),
P[0] = M[0]&Q[0] = 0
which is the sum register and C (3-bit) which is the carry
ADD (S[0],C[0],Q[0]&M[1], register. The “,” symbol means a bit-wise AND operation.
Step1 Q[1]&M[0], 0) The inputs to HA (Half-Adder) in row-1 and column-3
S[0] = 0⊕0⊕1 = 1 C[0] = 0 P[1] = 1 are the result from performing the bit-wise AND
ADD (S[1], C[1], Q[2]&M[0], operation between the fourth bit in the multiplicand
Step2 Q[1]&M[1], C[0]) register with the first element stored in the multiplier
S[1] = 0⊕0⊕0 = 0 C[1] = 0 register and the result from performing the bit-wise AND
operation between the third element in the multiplicand
ADD (S[2], C[2], Q[3]&M[0],
with the second bit in the multiplier. The adders, which
Q[2]&M[1], C[1])
are represented by the same color in Figure 3 must be
Step3 S[2] = 0⊕0⊕0 = 0 C[2] = 0 executed in parallel. As mentioned the number of steps
ADD (S[0], C[0], Q[0]&M[2], S[1], 0) needed is 6, in step-1 the HAs colored in dark are
S[0] = 0⊕0⊕0 = 1 C[0] =0 P[2]=S[0]= 0 executed simultaneously. In step-2 the FAs in the next
row are executed in parallel. In step-3 the 3rd row FAs are
ADD (S[3], C[3], 0, Q[3]&M[1], C[2])
executed simultaneously. In step-4 the HA colored in
S[3] = 0⊕0⊕0 = 0 C[2] = 0 light in the last row is executed. In step-5 the FA colored
Step4
ADD (S[1],C[1],Q[1]&M[2], S[2], C[0]) in the middle of the last row is performed. In step 6 the
S[1] = 0⊕0⊕0 = 0 C[1] = 0 last FA is executed.
ADD (S[2], C[2], Q[2]&M[2], S[3],C[1])
S[2] = 0⊕0⊕0 = 0 C[2] = 0
Step5
ADD (S[0], C[0], Q[0]&M[3], S[1], 0)
S[0] = 0⊕0⊕0 =0 C[0] = 0 P[3]=S[0]= 0
ADD (S[3], C[3],Q[3]&M[2], C[2],C[3])
S[3] = 0⊕0⊕0 = 0 C[3] = 0
Step6
ADD (S[1],C[1], Q[1]&M[3], S[2], C[0])
S[1] = 1⊕0⊕0 = 1 C[1] = 0
ADD (S[2], C[2],Q[2]&M[3], C[1], S[3])
Step7
S[2] = 0⊕0⊕0 = 0 C[2] = 0
ADD (S[3],C[3],Q[3]&M[3], C[2], C[3])
S[3] = 0⊕0⊕0 = 0 C[3] = 0
Step8
P[4] = S[1] = 1 P[5] = S[2] = 0
Figure 3: Architecture of Carry Save Array Multiplier
P[6] = S[3] = 0 P[7] = C[3] = 0
So the result is 00010010 (decimal 18) An example that emulates the Carry Save Array
multiplier is given in Table 3 below. In this example the
2.3 Carry Save Array Multiplier multiplicand=00102 (decimal 2) and multiplier=10012
(decimal 9).
The architecture of carry save array multiplier is
shown in Figure 3. It has a regular structure and the Table 3: Emulation of 4-bit Carry Save Array Multiplier
number of steps that are required by n-bit multiplier is Step Operation
(2n-2). This was inferred during the analysis process of
M=1001, Q=0010, S=0000, C=0000
the multiplier by using the block diagram of a 4-bit Initialization
P[0] = M[0]&Q[0] = 0
multiplier and extending it to multipliers of larger lengths
then counting the number of steps for each multiplier. ADD (S[0],C[0],Q[1]&M[0], Q[0]&M[1],
Each step consists of one or more full-add operations. All Step1 0)
full-add operations that run at the same step must be S[0] = 0⊕0⊕1 = 1 C[0] = 0
executed in parallel. The number of full adders that are ADD (S[1], C[1], Q[2]&M[0],
required by an n-bit multiplier is (n*(n-1)). Q[1]&M[1], 0)
S[1] = 0⊕0⊕0 = 0 C[0] = 0
The number of steps needed by a 4-bit carry save
array multiplier to perform the multiplication process is 6

242
ADD(S[2],C[2],Q[3]&M[0],Q[2]&M[1],0) resulted from ANDing the multiplicand register with the
S[2] = 0⊕0⊕0 = 0 C[2] = 0 third bit in the multiplier register. In step-1, the carry save
P[1] = S[0] = 1 adder in row 1 is executed. It accepts three inputs and
produces two outputs. The outputs are the sum (8-bit) and
ADD (S[0], C[0], Q[0]&M[2], C[0], S[1]) the carry (9-bit). So the sum S=P2⊕P3⊕P4 and the carry C
S[0] = 0⊕0⊕0 = 0 C[0] = 0 = (P2&P3) | (P3 & P4) | (P4&P2). In step-2 the CSA in row
ADD (S[1], C[1], Q[1]&M[2], C[1], S[2]) 2 is performed, where it accepts P1 and the two outputs
S[1] = 0⊕0⊕0 = 0 C[1] = 0 produced by the CSA in row1 as an input in order to
Step2 produce the S (Sum) and the Carry(C). In step-3 a parallel
ADD (S[2],C[2],Q[2]&M[2],
add operation is performed between the sum and the carry
C[2],Q[3]&M[1])
resulted, from step-2 to get the final result (the carry bit is
S[2] = 0⊕0⊕0 = 0 C[2] = 0 0). Parallel add means that the unsigned addition is
P[2] = S[0] = 0 performed between the sum and the carry, where all bits
ADD (S[0], C[0], Q[0]&M[3], C[0], S[1]) of the sum and the carry are used at the same time to
S[0] = 0⊕0⊕0 = 0 C[0] = 0 contribute in the addition.
ADD (S[1], C[1], Q[1]&M[3], C[1], S[3]) The architecture for an 8-bit Wallace Tree multiplier is
shown in Figure 5. The number of steps needed by 8-bit
Step3 S[1] = 1⊕0⊕0 = 1 C[1] = 0
Wallace Tree multiplier equals ⎡(log3/2(n/2))⎤ = 4 [14]
ADD(S[2],C[2],Q[2]&M[3],C[2], plus 1 step to perform the parallel addition operation
Q[3]&M[2])
S[2] = 0⊕0⊕0 =0 C[2] = 0 P[3] = S[0] = 0 P4 P3 P2
ADD (S[0], C[0], 0, C[0], S[1]) P1
Step4 S[0] = 0⊕0⊕1 = 1 C[0] = 0 CSA
P[4] = S[0] = 1
ADD (S[1], C[1], S[0], C[0], C[1])
Step5 S[1] = 0⊕0⊕0 = 0 C[1] = 0
P[5] = S[1] = 0
CSA
ADD (S[2], C[2], Q[3]&M[3], C[1], C[2])
Step6 S[2] = 0⊕0⊕0 = 0 C[2] = 0
P[6] = S[2] = 0 P[7] = C[2] = 0 C S
So the result is 00010010 (decimal 18)
Parallel
Adder

2.4 Wallace Tree Multiplier


Carry bit Result
The Wallace Tree multiplier shown in Figure 4 has an
irregular structure [13]. Therefore, the design of the Figure 4: Architecture of 4-bit Wallace Tree Multiplier
hardware circuitry needs to be modified for each
multiplier of specific length. The number of carry save An example that emulates a Wallace Tree multiplier
adders needed by n-bit Wallace tree multiplier is of length 4 is given in Table 4 below. In this example the
proportional to the length of the multiplier (it equals n-1 if multiplicand=00102 (decimal 2) and multiplier=10012
the multiplier has a length of 3*k where k=1, 2, 3, 4…). A (decimal 9).
Wallace tree multiplier has an optimal speed as discussed
in [14]. For n-bit Wallace tree multiplier, the number of Table 4: Emulation of 4-bit Wallace Tree Multiplier
steps needed is ⎡(log3/2(n/2) + 1)⎤ [15]. Each step except Step Operation
the last step consists of one or more carry-save-add
M=0000000010, Q=00001001,
operation. All carry save adders, which execute at the Initialization
P1,P2,P3,P4 are initialized to zero
same step must be performed in parallel. The last step
requires a parallel add operation. P1= Q & M[0]= 00000000
A 4-bit Wallace Tree multiplier requires ⎡(log 3/2 Shift M to right so M=00000001
(n/2))⎤ = 2 steps [14] plus one step to perform the parallel P2= Q & M[0]= 000001001
addition operation. Each P is resulted from applying the Shift M to right so M=000000000
logical AND operation between the multiplicand register Shift P2 to left so P2=000010010
and a bit from the multiplier register, for example P3 is

243
P3= Q & M[0]= 00000000 ripple carry array and carry save array multipliers. For
Shift M to right so M=000000000 Wallace tree multiplier, the user needs to enter the
and shift P3 to left number 4 or the number 8. Finally the user is requested to
enter the two operands of the multiplier in a binary form
P4= Q & M[0]= 000000000 and shift within the specified length.
P4 to left A second window, which is called the emulation
Shift M to right so M=00000000 window, appears in the screen directly after the user
and shift P4 to left finishes entering the requested data in the main window.
Q = P2⊕P3⊕P4 = 000010010 The emulation window emulates the multiplier selected
M =(P2 & P3 ) | ( P3 & P4 ) | (P2 &P4) by the user and provides the result of the multiplication
Step1
= 000000000 process and the complexity (number of operations
Shift M to left so M=000000000 required by the multiplier with a specification of their
types) of the multiplier. The emulation of the multiplier
P3 = Q⊕M⊕P1 = 00010010
reflects the actual execution of the multiplier based on its
P4 =(Q & M ) | ( Q & P1 ) | (M &P1) =
Step2 architecture. Once the emulation of the multiplier is
000000000
complete, the user returns to the main window.
Shift P6 to left So P6=000000000
Step3 ADD(P3, P4 ) = 00010010 4. Handle-C FPGA Implementation
So the result = 00010010 (18 decimal).
All the multipliers discussed in the previous sections
P8 P7 P6 P5 P4 P3 P2 0 were implemented on a Xilinx XC4000 FPGA
development system. This enables the assessment and
CSA CSA subsequent comparison of the area and time complexities
of the multiplier architectures. The synthesis process was
carried out using the Handle-C programming language
and its supporting tools. All the C++ multipliers
programs were translated to Handle-C code and
CSA CSA subsequently compiled to generate the gate level circuitry
for FPGA implementation.

4.1 Area Complexity of the Multipliers


CSA
The physical area complexity of the multipliers was
estimated using Handle-C debugger utility, which gives
the equivalent number of NAND gates that would be
CSA required following an optimisation of the synthesis.

C S Hennessy
The area complexity of a 4-bit Hennessy multiplier
Parallel was assessed using the debugger of the Handle-C
Adder
package. As show in Figure 10, the required area by 4-bit
Carry bit Result
Hennessy multiplier is estimated to be the equivalent of
590 NAND gates.
Figure 5: Architecture of 8-bit Wallace Tree Multiplier.
Ripple Carry Save
The Ripple Carry Save multiplier was implemented
3. Multipliers Simulation Environment in Handle-C using both recursive and non-recursive
programming techniques. The impact of each
A software environment that enables the simulation implementation on the required area was also assessed.
of the multiplication algorithms outlined in the previous The estimated number of NAND gates required to the
section was implemented using C++ programming synthesis of a 4-bit non-recursive ripple carry array
language. The system mainly has two interface windows. multiplier is 559 gates. However, when the same
In the first window, the user selects one of the five multiplier is synthesized recursively the estimated number
multipliers that have been implemented. Then the user is of NAND gates that would be required increased to 2009
requested to enter the length of the multiplier. This must gates. A 5-bit non-recursive ripple carry save array
be an integer number between 1 and 128 for Hennessy,

244
multiplier was also implemented with an equivalent gate cycles, which is the number of clock cycles needed by the
count of 843 NAND gates. multiplier to change its output from C416 to E116. So the
maximum time needed by the 4-bit Hennessy multiplier to
Carry Save perform the multiplication process = 13+12.5+12+13+12
The area complexity of the 4-bit Carry Save Array +13+12.5+12+12.5=112.5ns, but the theoretical time =
multiplier was also studied using recursive and non- 12.5*8=100ns. The FPGA implemented multiplier needs
recursive techniques. A huge difference in the required 1 additional clock cycle for interfacing with the logic
number of gates was observed between the two analyzer. From Table 5, it can be seen that depending on
implementations. The non-recursive technique requires the numbers being multiplied the multiplier needs
the equivalent of 547 NAND gates to implement the 4-bit different number of clock cycles to produce different
multiplier. However, using recursion increased the gate outputs, which is in line with the theoretical analysis. For
count to 1543 NAND gates. When a 5-bit non-recursive example, the multiplier needs 6 clock cycles in order to
carry save array multiplier was implemented, Handle-C produce the result of 1*1. The clock cycles needed by the
debugger estimated the required number of gates after multiplier to multiply 0*0 are highlighted in Table 5.
optimization to be the equivalent of 831 NAND gates.
Table 5: Part of the logic analyzer stream for 4-bit
Wallace Tree Hennessy multiplier
A 4-bit Wallace Tree multiplier was implemented Sample Output Timestamp
using only non-recursive programming techniques. As 107 C4 13.000 ns
shown in Figure 11 Handle-C debugger estimated the 108 C4 12.500 ns
required area after optimization to be the equivalent of 109 C4 12.000 ns
1221 NAND gates. 110 C4 13.000 ns
111 C4 12.000 ns
4.2 Time Complexity of the Multipliers 112 C4 13.000 ns
113 C4 12.500 ns
The real time needed for a 4-bit multiplier that has 114 C4 12.000 ns
been synthesised and already downloaded into the FPGA, 115 C4 12.500 ns
to execute a single multiplication operation was measured 116 E1 12.500 ns
using a logic analyzer. The two operands of the multiplier ... E1 12.500 ns
are altered within a loop (in the multiplier code) from
119 E1 12.500 ns
00002 to 11112, the output of the multiplier is directed to
120 E1 13.000 ns
the appropriate FPGA development system output pins
121 00 12.500 ns
and subsequently captured by the logic analyzer. The
FPGA clock is connected to the logic analyzer in order to 122 00 12.500 ns
act as an external clock. The FPGA clock frequency is set 123 00 12.000 ns
at 80 MHz, which gives a clock cycle of 12.5ns. In order 124 00 13.000 ns
to appropriately compare the timing complexities, all the 125 00 12.000 ns
multipliers below were synthesised using non-recursive 126 00 12.500 ns
techniques. … 01 12.500 ns
131 01 12.500 ns
Hennessy
The maximum number of clock cycles needed by 4- Ripple Carry
bit Hennessy multiplier is 8 (the multiplier needs 4 steps The number of clock cycles needed by the 4-bit
and each step consist of one add operation and one shift ripple carry array multiplier is 8 (3*4 - 4). Table 6 shows
operation) and this occur when the multiplier is used to part of the result captured by the logic analyzer when the
multiply M=15 by Q=15. Table 5 shows part of the results two operands of the multiplier were being altered from
that were captured by the logic analyzer, when the two 00002 to 11112. The number of FPGA clock cycles needed
operands of the multiplier were being altered from 00002 by the multiplier to produce any result is 9 (1 additional
to 11112. clock cycle needed to interface the multiplier with the
logic analyzer), which is the number of clock cycles
Table 5 shows the output of the multiplier at a given time needed by the multiplier to change its output from one
and the Time-Stamp column shows the time required for value to another value as shown in Table 6. So the time
the operation. The number of FPGA clock cycles needed needed by the 4-bit ripple carry array multiplier to
by the multiplier to produce the result E116, that is the perform the multiplication process = 12.5*9=112.5ns, but
total time the multiplier takes to perform a multiplication the theoretical time = 12.5*8=100ns. As explained earlier,
process between M=15 and Q=15 is 9 FPGA clock

245
the difference is due to the need for 1 additional clock 32 E1 12.500 ns
cycle to interface with the logical analyzer. 33 00 12.500 ns
From Table 6, the multiplier needs the same number … 00 12.500 ns
of clock cycles to produce the different outputs. For 39 00 12.500 ns
example the multiplier needs 9clock cycles in order to 40 01 12.500 ns
produce the result of 1*1 and it the same number of clock
cycles to change its output from 01 to 02. This concurs Wallace Tree
with the theoretical analysis of the multiplier. The clock The number of clock cycles needed by 4-bit Wallace
cycles needed to multiply 2*2 are highlighted in Table 6. tree multiplier is 3 (log3/2(4/2) +1 =3). Table 8 shows part
from the results captured by the logic analyzer as the two
Table 6: Part of logic analyzer stream for 4-bit ripple operands of the multiplier were being altered from 00002
carry array. to 11112.
Sample Output Timestamp As shown in Table 8 the number of FPGA clock
94 00 12.500 ns cycles needed by the multiplier to produce any result is 6
… 00 12.500 ns as it changes its output from one value to another. The 3
102 00 12.500 ns additional clock cycles were needed to synchronize the
103 01 12.500 ns multiplier with the logic analyzer. So the time needed by
… 01 12.500 ns 4-bit Wallace Tree multiplier to perform the
111 01 12.500 ns multiplication process = 12.5*6=75ns which is twice the
112 04 12.500 ns time predicted by the theoretical analysis. The difference
113 04 12.500 ns between the required theoretical and actual clock cycles
tends to reduce as the multiplier length increases. The
Carry Save reasons being that the number of synchronization clock
The number of clock cycles needed by the 4-bit carry cycles remains fixed independent of the multiplier length
save array multiplier is 6 (2*4 - 2). Table 7 shows part of and a longer multiplier would require higher number of
the results that were captured by the logic analyzer when clock cycles. This was verified by implementing an 8-bit
the two operands of the multiplier were being altered Wallace Tree multiplier.
from 00002 to 11112. From Table 8, the Wallace Tree multiplier needs the
The number of FPGA clock cycles needed by the same number of clock cycles to produce different outputs.
multiplier to produce any result is 7 (1 one additional For example, the multiplier needs 6 clock cycles to
clock cycle needed to interface the multiplier with the produce the result of 1*1 and it needs 6 clock cycles to
logic analyzer). This is demonstrated in Table 7 as being change its output from 04 to 09. This is in accordance
the number of clock cycles needed by the multiplier to with the theoretical analysis. The clock cycles needed by
change its output from one value to another. So the time to multiply 3*3 are highlighted in Table 8.
needed by carry save array multiplier to perform the A summary of the area complexity in terms of the
multiplication process = 12.5*7=87.5 ns, but the estimated equivalent number of NAND gates as well as
theoretical time = 12.5*6=75 ns. As in the case of the the theoretical and practical time complexities are shown
previous multiplier, the additional cycle is needed to in Table 9. All the multipliers in the table below are 4-
interface the multiplier with the logic analyzer. From bits long and were implemented in Handle-C using non-
Table 7, the multiplier needs the same number of clock recursive programming techniques.
cycles to produce different outputs, for example the
multiplier needs 7 clock cycles in order to produce the Table 8: Part of logic analyzer stream for 4-bit Wallace
result of 1*1 and it needs 7 clock cycles to change its tree.
output from 4C to E1. This is the same as established by Sample Output Timestamp
the theoretical analysis. The clock cycles needed by to 2 01 12.500 ns
multiply 15*15 are highlighted in Table 7. … 01 12.500 ns
8 04 12.500 ns
Table 7: Part of logic analyzer stream for 4-bit carry save … 04 12.500 ns
multiplier. 13 04 12.500 ns
Sample Output Timestamp 14 09 12.500 ns
19 C4 12.500 ns … 09 12.500 ns
… C4 12.500 ns 19 09 12.500 ns
25 C4 12.500 ns
26 E1 12.500 ns
... E1 12.500 ns

246
Table 9: Summary of Complexities More in depth assessment of the physical
Multiplier complexities, both area and time, of the multipliers
Type Ripple Carry reported in this study as well as others architectures will
Wallace
Hennessy Carry Save be investigated further in the future. This will be done for
Tree
Complexity Array Array both FPGA and full-custom implementations.
Area in
NAND
590 559 547 1221 References
Theoretical [1] S.D. Pezaris, “A 40 ns 17-bit Array Multiplier”, IEEE
Time Clock 8 8 6 3
Cycles
Trans. Computers, vol. 20, no.40, pp. 442-447, April
1971.
Practical [2] K.Z. Pekmestzi, “Multiplexer- Based Array
Time Clock 9 9 7 6 Multipliers”, IEEE Trans. Computers, vol.48, no. 1,
Cycles
pp. 15-23, Jan. 1999.
[3] J.C. Hoffman and R. Kitai, “Parallel Multiplier
5. Conclusions and Future Work Circuit”, Electronic Letters, vol.4, May 1968.
[4] K. Bickerstaff, M.J. Schulte, and E.E. Swartzlander
The paper discussed the architecture and operation of Jr., “Parallel Reduced Area Multipliers”, J.VLSI
four major integer multipliers. Namely Hennessy, ripple Signal Processing, vol.9 pp. 181-192, 1995.
carry array, carry save array and Wallace tree. The paper [5] P.J. Song and G.D. Micheli, “Circuit and
detailed the area and time complexities of each multiplier Architecture Trade-Offs for High-Speed
and their implication on its performance. The study Multiplication”, IEEE J. Solid-State Circuits, vol. 26,
shows that Wallace tree is the best in terms of the time no. 9, pp. 1,184 -1,198, Sept. 1991.
complexity but it requires the highest number of gates, [6] C.S. Wallace, “Suggestion for a Fast Multiplier”,
which means it will consume larger chip area than the IEEE Trans. Electronic Computers, vol.3, pp.14-17,
other multipliers. The carry-save multiplier requires the 1964.
least number of gates and hence small chip area compared [7] I. Koren, Computer Arithmetic and Algorithms,
with the other multiplier architectures. The time Brookside Court Publishers, 1998.
complexity of the carry-save array multiplier is better than [8] M.J. Schulte, P.I. Balzola, A. Akkas, and R.W.
both Hennessy and the ripple-carry, but it is twice that of Brocato, “Integer Multiplication with Overflow
the Wallace tree from theoretical analysis view point. Detection or Saturation”, IEEE Trans. Computers,
This indicates that the carry-save array multiplier is the vol. 49, no. 7, July 2000.
best compromise among the four architectures in terms of [9] J. Rose, A. El-Gamal and A. Sangiovanni-
area and time complexities. Vincentelli, “Architecture of Field-Programmable
The comparison of recursive and non-recursive Gate Arrays”, Proceedings of the IEEE, pp.1013-
techniques for multipliers implementation indicate that 1029, Vol.81, No.7, July 1993.
the hardware synthesis language Handle-C cannot cope [10] Wayne Wolf, FPGA-Based System Design, Prentice
well with recursion and this is reflected in the very high Hall, 2004
gate count that was generated for the recursive synthesis [11] Celoxica Limited, Handle-C Language Reference
of both the ripple-carry and carry-save multipliers. This Manual, 2003
may seem a shortcoming of the language, but it requires [12] David A. Patterson and John L. Hennessy, Computer
further investigation in the future. Organization & Design: The Hardware/Software
The simulation environment described in the paper Interface, 2nd edition, Morgan Kaufman, 1998.
gives the user the ability to understand the multiplication [13] http://www.andraka.com.
process and assess the various algorithms. It also allows [14]
the user to emulate the multiplication of particular http://www.vlsi.ee.upatras.gr/~sklavos/Papers01/P
numbers and trace the execution process. Some further ATMOS01_FaultSecure.pdf
enhancements will be introduced to the simulation [15]
environment as part of the future work. This will include http://www.coecs.ou.edu/ldebrunn/www/teaching/
real time measurements, graphical visualization of the arithmetic/L11.pdf
multiplication process and introduction of additional
multiplication algorithms.

247

Anda mungkin juga menyukai