AbstractDecimal arithmetic has received considerable attention recently due to its suitability for many financial and
commercial applications. In particular, numerous algorithms
have been recently proposed for decimal multiplication. A major
approach to decimal multiplication shaped by these proposals is
based on performing the decimal digit-by-digit multiplication in
binary, converting the binary partial product back to decimal,
and then adding the decimal partial products as appropriate
to form the final product in decimal. With this approach, the
efficiency of binary-to-BCD partial product conversion is critical
for the efficiency of the overall multiplication process. A recently
proposed algorithm for this conversion is based on splitting the
binary partial product into two parts (i.e., two groups of bits),
and then computing the contributions of the two parts to the
partial BCD result in parallel. This paper proposes two new
algorithms (Three-Four split and Four-Three split) based on this
principle . We present our proposed architectures that implement
these algorithms and compare them to existing algorithms. The
synthesis results show that the Three-Four split algorithm runs
15% faster and occupies 26.1% less area than the best performing
equivalent circuit found in the literature. Furthermore, the FourThree split algorithm occupies 37.5% less area than the state of
the art equivalent circuit.
I. I NTRODUCTION
Decimal arithmetic is natural for many applications especially where the input data is already in decimal format
such as financial and scientific applications. Therefore, several
algorithms have been proposed to support the basic decimal
arithmetic operations (including multiplication) in hardware.
Decimal multiplication in particular has been tackled in
several ways. For example, one way for decimal multiplication
is to perform the multiplication directly in decimal. Another
approach is to convert the operands to binary, perform the
multiplication in binary, and then convert the result back to
decimal. A third approach to decimal multiplication involves
performing decimal digit-by-digit multiplication in binary and
then converting the resulting binary partial product to decimal.
Decimal partial products are then added as appropriate to form
the final decimal product. This last approach is illustrated in
Figure 1 where Pij are the binary partial products that are
to be converted to the decimal partial products Dij . In this
technique, the performance of binary to decimal conversion is
critical to the overall multiplication performance.
226
P11
Y1
Y0
X1
X0
P01
P00
P10
Binary
Partial
Products
BIN2BCD
D01
D11
D10
Fig. 1.
D00
Decimal
Partial
Products
Most
Three Bits
Contribution
to DH
Contribution
to DL
000
0000
0000
010
0011
0010
101
1000
0000
TABLE I
E XAMPLES ON THE CONTRIBUTION OF THE THREE MOST SIGNIFICANT
BITS TO DH AND DL AS COMPUTED BY [2]
227
A6 A5 A4
Contribution
Generator
Least
Four Bits
Contribution
to DH
Contribution
to DL
0000
0000
0000
0011
0000
0011
0111
0000
0111
1000
0000
1000
1111
0001
0101
4
X7 X6 X5 X4
A3 A2 A1 A0
Contribution
Generator
Z4
DH
Generator
X3 X2 X1
C
DL
Generator
4
DH
Fig. 2.
Z3 Z2 Z1
TABLE II
E XAMPLES OF THE CONTRIBUTION OF THE FOUR LEAST SIGNIFICANT
BITS TO DH AND DL
4
DL
3bit
ADDER
Cin
Cout
3bit
ADDER
Cin
C
0
0
Fig. 3.
DL Generator
228
X6 X5 X4 0 0 Z4
X7
3bit
ADDER
Cout
Cin
Z4 C
Optimized
Addition Stage II
DH Generator
Fig. 4.
X5 X6 X4
X7
Optimized DH Generator
Fig. 6.
A0
X3 X2 X1 Z3 Z2 Z1
DH [3] = X7
DH [2] = X5 Z4 + X5 C + X6
DH [1] = X 5 X4 C + X5 X 4 + Z4 C + X5 Z 4 C + X 5 X4 Z4
Optimized
Addition Stage I
C
Carry
Generator
S3
S2
DH [0] = X 4 Z4 C + X 4 Z 4 C + X4 Z4 C + X4 Z 4 C
S1
Optimized
Correction
DL [3] DL [2] DL [1] DL [0]
Fig. 5.
Optimized DL Generator
Most
Four Bits
Contribution
to DH
Contribution
to DL
0000
0000
0000
0011
0010
0100
0111
0101
0110
1000
0110
0100
1010
1000
0000
TABLE III
E XAMPLES OF THE CONTRIBUTION OF THE FOUR MOST SIGNIFICANT BITS
TO DH AND DL
A6 A5 A4 A3
Contribution
Generator
3
4
DL [3] = S 3 S2 C + S 3 S1 C + S3 C + S3 S 2 S 1
DL [2] = S 2 S 1 C + S2 C + S2 S1
Y7 Y6 Y5 Y4
DL [1] = S1 C + S 1 C
Y3 Y2 Y1
DH
Generator
4
DH
Fig. 7.
A2 A1 A0
DL
Generator
4
DL
C = Z 3 X3 + Z 2 X3 + Z 2 X2 + Z 3 X2 + Z 3 X1
Similar observations also hold for the DH Generator block
shown in Figure 4. As one can see, two of the inputs of the
3-bit adder are always zero and the Cout output is unused.
Replacing this adder with a customized version in which
the unneeded inputs and outputs are removed and only the
combinations that can appear on its inputs are considered
results in the circuit shown in Figure 6. The output logic
229
equations:
Y7 = A6 A4
Architecture
Y6 = A5 A4 A3 + A6 A4 A3 + A5 A3 + A6 A3
Delay
(ns)
Area
(m2 )
Dynamic
Power
(W )
Four-Three split
1.21
1095
500.49
Y5 = A5 A4 A3 + A6 A4 A3 + A5 A4 A3 + A6 A3
Three-Four split
1.13
1348
689.24
Y4 = A6 A5 A4 A3 + A5 A4 A3 + A5 A4 A3 + A6 A3
Architecture of [2]
1.33
1754
832.1
Y3 = A6 A5 A4 A3 + A5 A4 A3
1.97
2241
1112.6
Y2 = A6 A5 A4 A3 + A5 A4 A3 + A6 A4 A3 + A5 A4 A3
TABLE IV
P ERFORMANCE COMPARISON OF VARIOUS ARCHITECTURES
Y1 = A6 A5 A4 A3 + A5 A4 A3 + A5 A4 A3 + A6 A3
Note that the contribution of the four most significant bits
to DL has the least significant bit always zero. Therefore,
these four bits contribute only with three bits with weights of
2, 4, and 8. The overall architecture of the Four-Three split
binary-to-BCD convertor is shown in Figure 7. The circuit
that computes the contribution of the DL Generator circuit is
similar to the circuit shown in Figure 3 with inputs Z3 Z2 Z1
replaced with 0A2 A1 and X3 X2 X1 replaced with the three
bits coming out of the A6 A5 A4 A3 contribution generator
(Y3 Y2 Y1 ). Observations similar to those made in Figure 5 can
also be made. For example, now the top 3-bit adder has Cin
always zero and one of its inputs is always zero (since we
replaced Z3 Z2 Z1 with 0A2 A1 ). The optimized DL generator
circuit consists of three blocks similar to the optimized DL
generator of Figure 5. The implementation of these blocks
has output logic equations as follows:
Optimized addition stage I:
S3 = Y3 + Y2 Y1 A1 + Y2 A2
S2 = Y2 A2 A1 + Y 2 A2 + Y2 Y 1 A2 + Y 2 Y1 A1 + Y1 A2 A1
S1 = Y1 A1 + Y 1 A1
Optimized correction: same as the optimized correction of
the Three-Four split architecture of Figure 5.
Carry generator:
C = Y2 Y1 A2 + Y3 A1 + Y3 A2 + Y2 A2 A1
The DH Generator block is also similar to that shown in the
circuit of Figure 4 with inputs 00Z4 replaced with all zeros.
Similarly, a customized circuit can be made similar to Figure
6. The output logic equations for this circuit are as follows:
DH [0] = Y4 C + Y 4 C
DH [1] = CY4 + Y5
DH [2] = Y6
DH [3] = Y7
IV. R ESULTS
A. Performance Evaluation
In this section, we compare four different architectures for
binary partial product to decimal partial product conversion.
These architectures are: (i) our Three-Four split, (ii) our
Four-Three split, (iii) the architecture proposed in [2] with
the C2 corrected as noted before, and (iv) the version of
architecture of [6] which is corrected in [2]. We describe all
architectures using Verilog HDL data flow modeling 1 . The
1 The
230
AND
F UTURE W ORK
R EFERENCES
[1] http://www.just.edu.jo/ oda/research/comp arith/decimal/bin2bcd/.
[2] J. Bhattacharya, A. Gupta, and A. Singh. A high performance binary
to BCD converter for decimal multiplication. 2010 International
Symposium on VLSI Design, Automation and Test, 2010.
[3] Fadi Busaba, Timothy Slegel, Steven Carlough, Christopher Krygowski,
and John G. Rell. The design of the fixed point unit for the z990
microprocessor. In Proceedings of the 14th ACM Great Lakes symposium
on VLSI, GLSVLSI 04, pages 364367, New York, NY, USA, 2004.
ACM.
[4] M.A. Erle and M.J. Schulte. Decimal multiplication via carry-save
addition. In Application-Specific Systems, Architectures, and Processors,
2003. Proceedings. IEEE International Conference on, pages 348 358,
2003.
[5] M.A. Erle, E.M. Schwarz, and M.J. Schulte. Decimal multiplication
with efficient partial product generation. In Computer Arithmetic, 2005.
ARITH-17 2005. 17th IEEE Symposium on, pages 21 28, 2005.
231