Anda di halaman 1dari 6

Fast and Compact Binary-to-BCD Conversion

Circuits for Decimal Multiplication


Osama Al-Khaleel , Zakaria Al-Qudah , Mohammad Al-Khaleel , Christos A. Papachristou , and Francis G. Wolff

Jordan University of Science and Technology, Irbid, Jordan


Email: oda@just.edu.jo
Yarmouk University, Irbid, Jordan
Email:{zakaria.al-qudah,khaleel} @yu.edu.jo
Case Western Reserve University, Cleveland, OH
Email: {cap2,fxw12}@case.edu

AbstractDecimal arithmetic has received considerable attention recently due to its suitability for many financial and
commercial applications. In particular, numerous algorithms
have been recently proposed for decimal multiplication. A major
approach to decimal multiplication shaped by these proposals is
based on performing the decimal digit-by-digit multiplication in
binary, converting the binary partial product back to decimal,
and then adding the decimal partial products as appropriate
to form the final product in decimal. With this approach, the
efficiency of binary-to-BCD partial product conversion is critical
for the efficiency of the overall multiplication process. A recently
proposed algorithm for this conversion is based on splitting the
binary partial product into two parts (i.e., two groups of bits),
and then computing the contributions of the two parts to the
partial BCD result in parallel. This paper proposes two new
algorithms (Three-Four split and Four-Three split) based on this
principle . We present our proposed architectures that implement
these algorithms and compare them to existing algorithms. The
synthesis results show that the Three-Four split algorithm runs
15% faster and occupies 26.1% less area than the best performing
equivalent circuit found in the literature. Furthermore, the FourThree split algorithm occupies 37.5% less area than the state of
the art equivalent circuit.

I. I NTRODUCTION
Decimal arithmetic is natural for many applications especially where the input data is already in decimal format
such as financial and scientific applications. Therefore, several
algorithms have been proposed to support the basic decimal
arithmetic operations (including multiplication) in hardware.
Decimal multiplication in particular has been tackled in
several ways. For example, one way for decimal multiplication
is to perform the multiplication directly in decimal. Another
approach is to convert the operands to binary, perform the
multiplication in binary, and then convert the result back to
decimal. A third approach to decimal multiplication involves
performing decimal digit-by-digit multiplication in binary and
then converting the resulting binary partial product to decimal.
Decimal partial products are then added as appropriate to form
the final decimal product. This last approach is illustrated in
Figure 1 where Pij are the binary partial products that are
to be converted to the decimal partial products Dij . In this
technique, the performance of binary to decimal conversion is
critical to the overall multiplication performance.

978-1-4577-1954-7/11/$26.00 2011 IEEE

While generic techniques for converting arbitrary binary


number into its decimal equivalent can be used, binary partial
products have several characteristics that could be exploited
to increase the performance of their conversion to decimal
(in Binary-Coded Decimal or BCD representation). First, the
result of BCD digit-by-digit multiplication can be at most
9 9 = 81 or (1001)BCD (1001)BCD = (1010001)2 which
requires 7 bits. Furthermore, when converting these binary
partial products to decimal, we need two BCD digits. Second,
out of the 27 = 128 possible 7-bit combinations, only 37 different combinations can appear as a result of decimal digit by
decimal digit multiplication. For example, while (0010100)2
can be a result of 5 4 or (0101)BCD (0100)BCD , there are
no two decimal digits whose multiplication result is (23)10 or
(0010111)2. Therefore (0010111)2 can not appear as a partial
product.
A high performance binary-to-BCD conversion algorithm
has been recently proposed by [2]. The algorithm converts the
7-bit binary partial product to 2-digit BCD partial product to
support high performance decimal multiplication. The basic
idea of the algorithm is to split the 7-bit binary partial
product into two groups: the three most significant bits and the
four least significant bits. The contribution of the three most
significant bits to each of the two decimal digits is computed.
This contribution is then added to the contribution of the four
least significant bits to form the final two-digit BCD partial
product.
In this paper we make two main contributions: First, we
propose an alternative, more efficient architecture to compute
the contribution of the two parts and to combine these contributions to form the final BCD result. Second, we turn our
attention to investigating a different way of splitting the 7bit binary partial product. In particular, we split the binary
partial product into the most four significant bits and the
least three significant bits. A distinguishing property of this
way of splitting is that the least three significant bits do not
contribute to the most significant BCD digit. Furthermore, the
contribution of the three least significant digits to the least
significant BCD digit is the three bits themselves. Therefore,
no circuit is needed to compute the contributions of the three

226

P11

Y1

Y0

X1

X0

P01

P00

P10

Binary
Partial
Products

BIN2BCD

D01
D11

D10

Fig. 1.

D00

Decimal
Partial
Products

Two-digit BCD Multiplication

least significant bits to the BCD digits of the result. As shown


in section IV, this results in a smaller conversion circuit.
We note that for this multiplication approach the size of
the conversion circuit is a particularly important performance
metric. This is because the number of binary-to-BCD conversion circuits needed for an n-digit fully parallel decimal
multiplication is n2 ; one circuit per partial product. Therefore,
the size of the overall binary-to-BCD conversion part of the
multiplication circuit is n2 times the size of one binary-toBCD partial product conversion circuit. In other words, the
size of the binary-to-BCD conversion circuits grows quadratically with the number of decimal digits to be multiplied.
On the other hand, the delay of the overall conversion part
of the multiplication circuit is the same as the delay of a
single conversion unit. This is due to the fact that all the
partial product conversion units operate in parallel. Therefore,
optimizing the area of the conversion circuit is obviously of
critical importance for such multiplication technique.
The rest of this paper is organized as follows. Section II
places this work in the context of existing work. Section
III describes the proposed algorithms in details. Section IV
presents the performance evaluation of our algorithms and
discusses the results. We present our future work plans and
conclude in Section V.

the final result is converted to BCD. The advantage of using


such a scheme is that it utilizes the binary multipliers already
available in configurable hardware.
Parallel architectures that perform digit-by-digit multiplications include [6], [7] and multi-digit architectures include [11].
The work in [13] involves multiplying the decimal digits and
adding the partial products all in binary and converting the
final result back to BCD. One advantage of such approach
is that the same multiplier can be used as binary or decimal
multiplier with the output taken before or after the binary-toBCD conversion stage respectively. On the other hand, such
an approach requires a complex binary-to-BCD conversion
circuit simply because it converts the entire binary result to
BCD at once. Furthermore, as the number of BCD digits to be
multiplied increases, the conversion circuit will become more
complex [13].
A new approach to decimal multiplication algorithms to
which our proposed algorithms belong involves multiplying
the decimal digits in binary, the binary partial products are
converted to decimal partial products, and then the decimal
partial products are added as appropriate in BCD to form
the final decimal result. This approach has the advantage of
a simple conversion circuit since binary-to-BCD conversion
is done on the result of one-digit by one-digit multiplication
(which is only a 7-bit binary number as explained before) as
opposed to the entire binary result whose size depends on the
number of digits to be multiplied.
A recent architecture based on this approach has been
proposed in [2] . In this work, a novel algorithm is proposed
to convert a 7-bit binary partial product to 2-digit BCD partial
product (denoted as DH and DL for most significant and
least significant BCD digits respectively). The algorithm splits
the 7-bit binary partial product into two parts: the three most
significant bits and the four least significant bits. Table I lists
some examples to show the contribution of the three most
significant bits to DH and DL . The algorithm then proceeds as

II. R ELATED W ORK


Generic techniques for binary-to-decimal conversion had
been developed long ago [9], [10]. When used in the context of
decimal multiplication, these techniques miss the opportunity
of optimizations that utilize the particularities of decimal multiplication. Therefore, this paper develops high performance
specialized conversion circuits for decimal multiplication.
Numerous research has been already done in the area
of decimal multiplication. For example, several sequential
decimal multiplication circuits have been proposed in [3][5],
[8]. Fully combinational (parallel) decimal multiplication has
been explored in various ways to support applications that
require high speed multiplication. For example, [12] proposes
an algorithm by which BCD operands are first converted to
binary. The operands are then multiplied in binary and then

Most
Three Bits

Contribution
to DH

Contribution
to DL

000

0000

0000

010

0011

0010

101

1000

0000

TABLE I
E XAMPLES ON THE CONTRIBUTION OF THE THREE MOST SIGNIFICANT
BITS TO DH AND DL AS COMPUTED BY [2]

follows. The four least significant bits are first BCD-corrected


if needed (by adding (0110)2). This process results into a BCD
digit and a potential carry. If there is a carry, it will be added to
the sum that produces (DH ). The resulting BCD digit is added
We note here that there is a minor error in computing C
2 in the
architecture of [2]. Equation 6 in [2] results into a wrong output when the
output needs to be 27 (3 9 or 9 3). The correct equation for this signal
is as follows: C2 = p4 p3 p2 + c1 p5 p3 + c1 p4 p3 + p6 p3 + p4 p2 p1 . We use
the correct equation for the performance comparison below.

227

A6 A5 A4
Contribution
Generator

Least
Four Bits

Contribution
to DH

Contribution
to DL

0000

0000

0000

0011

0000

0011

0111

0000

0111

1000

0000

1000

1111

0001

0101

4
X7 X6 X5 X4

A3 A2 A1 A0
Contribution
Generator

Z4

DH
Generator

X3 X2 X1
C

DL
Generator

4
DH
Fig. 2.

Z3 Z2 Z1

TABLE II
E XAMPLES OF THE CONTRIBUTION OF THE FOUR LEAST SIGNIFICANT
BITS TO DH AND DL

4
DL

Three-Four Split 7-bit Binary to 2-digit BCD Converter

to the contribution of the most significant bits to DL and the


result is corrected again. The outcome of this operation is the
least significant BCD digit (DL ) and a potential carry which is
added again to the circuit that computes DH . Therefore, DH
is computed as the contribution of the three most significant
bits to DH plus the two potential carries resulting from the
two BCD correction operations performed to compute DL .
As explained in the next few sections, our work improves
this algorithm by computing not only the contribution of the
three most significant bits to DH and DL , but also the contribution of the four least significant bits to DH and DL . For
both, DH and DL , the contributions of the two bit groups are
added. In addition, proper optimizations are applied to increase
the performance of our proposed architecture. Furthermore,
another algorithm is also proposed in which we split the 7-bit
binary partial product into the four most significant bits and
the three least significant bits. Using the same principle, we
show that this algorithm performs better due to the fact that
the three least significant bits make no contribution to DH .
Furthermore, the contribution of the three least significant bits
to DL is the three bits themselves. Therefore, no circuit is
needed to compute this contribution.

bits to DL is only three bits with weights of 2, 4, and 8. The


optimized logic of the A6 A5 A4 Contribution Generator block
of Figure 2 is also derived based on all possible combinations
of the three most significant bits and their contribution to DH
and DL . To compute DL , we add the contributions of the two
bit groups to form DL as shown in the DL Generator circuit
in Figure 3. Similarly, computing DH is done by adding the
contributions of the two bit groups to DH plus any carry (C
in Figures 2 and 3) generated from the DL Generator circuit.
The DH Generator circuit is shown in Figure 4.
A0
X3 X2 X1 Z3 Z2 Z1
0
Cout

3bit
ADDER

Cin

Cout

3bit
ADDER

Cin

C
0
0

III. P ROPOSED A RCHITECTURES

DL [3] DL [2] DL [1] DL [0]

A. The Three-Four Split Algorithm


The Three-Four split algorithm is illustrated in Figure 2
where the 7-bit binary number is split into two groups of bits:
the three most significant bits and the four least significant bits.
We compute the contribution of the three most significant bits
to each of the two BCD digits (DL and DH ) similar to what
has been done in [2] as discussed in Section II. In addition, the
contribution of the four least significant bits to each of the two
BCD digits (DL and DH ) is computed. Some examples of the
contribution of the four least significant bits to DL and DH are
listed in Table II. We observe that the contribution of the four
least significant bits to DH is only a one bit carry (the other
three bits are always zero) and DL [0] = A0 . An optimized
logic for the A3 A2 A1 A0 Contribution Generator block of
Figure 2 is derived based on all possible combinations of the
four least significant bits and their contribution to DH and DL .
Furthermore, the contribution of A6 A5 A4 to DL has the least
significant bit always zero. Therefore, the contribution of these

Fig. 3.

DL Generator

To speed up the operation of the DL Generator circuit in


Figure 3, we observe the following issues. First, the Cin input
in the top 3-bit adder is always zero. Furthermore, only a
subset of combinations can appear on the inputs of the 3bit adder. Therefore, we replace this adder with a customized
version in which we remove Cin and we consider only the
possible combinations of Z3 Z2 Z1 and X3 X2 X1 . The resulting
customized circuit is named optimized addition stage I in
Figure 5 and it is described by the following logic equations:
S3 = X 3 Z 3 + X 2 Z 1 + X 2 Z 2 + X 3 Z 3
S2 = X 2 Z 2 Z 1 + X 2 Z 2 + X 2 X 1 Z 1
S1 = X 1 Z 1 + X 1 Z 1
Second, observations similar to those of the top 3-bit adder
can be found in lower 3-bit adder. In addition to these

228

X6 X5 X4 0 0 Z4

X7

3bit
ADDER

Cout

Cin

Z4 C

Optimized
Addition Stage II

DH [3] DH [2] DH [1] DH [0]

DH [3] DH [2] DH [1] DH [0]

DH Generator

Fig. 4.

X5 X6 X4

X7

Optimized DH Generator

Fig. 6.

A0

equations for this customized adder are as follows:

X3 X2 X1 Z3 Z2 Z1

DH [3] = X7
DH [2] = X5 Z4 + X5 C + X6
DH [1] = X 5 X4 C + X5 X 4 + Z4 C + X5 Z 4 C + X 5 X4 Z4

Optimized
Addition Stage I
C

Carry
Generator

S3

S2

DH [0] = X 4 Z4 C + X 4 Z 4 C + X4 Z4 C + X4 Z 4 C

S1

B. The Four-Three Split Algorithm

Optimized
Correction
DL [3] DL [2] DL [1] DL [0]
Fig. 5.

Optimized DL Generator

observations, the most significant bit of the second operand of


this adder is also zero. Furthermore, the input C is replicated
twice and the Cout output is unused. Therefore, we also
replace this traditional 3-bit adder with a customized version in
which we remove the unnecessary inputs and outputs and we
consider only the possible combinations that can appear on its
inputs. The resulting block is named optimized correction
in Figure 5, and the resulting output logic equations are as
follows:

Most
Four Bits

Contribution
to DH

Contribution
to DL

0000

0000

0000

0011

0010

0100

0111

0101

0110

1000

0110

0100

1010

1000

0000

TABLE III
E XAMPLES OF THE CONTRIBUTION OF THE FOUR MOST SIGNIFICANT BITS
TO DH AND DL

A6 A5 A4 A3
Contribution
Generator
3
4

DL [3] = S 3 S2 C + S 3 S1 C + S3 C + S3 S 2 S 1
DL [2] = S 2 S 1 C + S2 C + S2 S1

Y7 Y6 Y5 Y4

DL [1] = S1 C + S 1 C

Y3 Y2 Y1

DH
Generator
4
DH

Third, we compute the carry (C) with a lookahead logic to


speed up the operation( carry generator block in Figure 5).
The logic equation for this carry is:

Fig. 7.

A2 A1 A0

DL
Generator
4
DL

Four-Three Split 7-bit Binary to 2-digit BCD Converter

C = Z 3 X3 + Z 2 X3 + Z 2 X2 + Z 3 X2 + Z 3 X1
Similar observations also hold for the DH Generator block
shown in Figure 4. As one can see, two of the inputs of the
3-bit adder are always zero and the Cout output is unused.
Replacing this adder with a customized version in which
the unneeded inputs and outputs are removed and only the
combinations that can appear on its inputs are considered
results in the circuit shown in Figure 6. The output logic

In this section we present a second design whereby we split


the 7-bit binary number into the four most significant bits and
the three least significant bits. We note that in this case the
least significant bits do not directly contribute to DH . Table III
shows some of the 11 possible combinations for the four most
significant bits and the contribution of these combinations
to DH and DL . The circuit that computes this contribution
to both DH and DL is described with the following logic

229

equations:
Y7 = A6 A4

Architecture

Y6 = A5 A4 A3 + A6 A4 A3 + A5 A3 + A6 A3

Delay
(ns)

Area
(m2 )

Dynamic
Power
(W )

Four-Three split

1.21

1095

500.49

Y5 = A5 A4 A3 + A6 A4 A3 + A5 A4 A3 + A6 A3

Three-Four split

1.13

1348

689.24

Y4 = A6 A5 A4 A3 + A5 A4 A3 + A5 A4 A3 + A6 A3

Architecture of [2]

1.33

1754

832.1

Y3 = A6 A5 A4 A3 + A5 A4 A3

Corrected Architecture of [6]

1.97

2241

1112.6

Y2 = A6 A5 A4 A3 + A5 A4 A3 + A6 A4 A3 + A5 A4 A3

TABLE IV
P ERFORMANCE COMPARISON OF VARIOUS ARCHITECTURES

Y1 = A6 A5 A4 A3 + A5 A4 A3 + A5 A4 A3 + A6 A3
Note that the contribution of the four most significant bits
to DL has the least significant bit always zero. Therefore,
these four bits contribute only with three bits with weights of
2, 4, and 8. The overall architecture of the Four-Three split
binary-to-BCD convertor is shown in Figure 7. The circuit
that computes the contribution of the DL Generator circuit is
similar to the circuit shown in Figure 3 with inputs Z3 Z2 Z1
replaced with 0A2 A1 and X3 X2 X1 replaced with the three
bits coming out of the A6 A5 A4 A3 contribution generator
(Y3 Y2 Y1 ). Observations similar to those made in Figure 5 can
also be made. For example, now the top 3-bit adder has Cin
always zero and one of its inputs is always zero (since we
replaced Z3 Z2 Z1 with 0A2 A1 ). The optimized DL generator
circuit consists of three blocks similar to the optimized DL
generator of Figure 5. The implementation of these blocks
has output logic equations as follows:
Optimized addition stage I:
S3 = Y3 + Y2 Y1 A1 + Y2 A2
S2 = Y2 A2 A1 + Y 2 A2 + Y2 Y 1 A2 + Y 2 Y1 A1 + Y1 A2 A1
S1 = Y1 A1 + Y 1 A1
Optimized correction: same as the optimized correction of
the Three-Four split architecture of Figure 5.
Carry generator:
C = Y2 Y1 A2 + Y3 A1 + Y3 A2 + Y2 A2 A1
The DH Generator block is also similar to that shown in the
circuit of Figure 4 with inputs 00Z4 replaced with all zeros.
Similarly, a customized circuit can be made similar to Figure
6. The output logic equations for this circuit are as follows:
DH [0] = Y4 C + Y 4 C
DH [1] = CY4 + Y5
DH [2] = Y6
DH [3] = Y7
IV. R ESULTS
A. Performance Evaluation
In this section, we compare four different architectures for
binary partial product to decimal partial product conversion.
These architectures are: (i) our Three-Four split, (ii) our
Four-Three split, (iii) the architecture proposed in [2] with
the C2 corrected as noted before, and (iv) the version of
architecture of [6] which is corrected in [2]. We describe all
architectures using Verilog HDL data flow modeling 1 . The
1 The

verilog code is posted at [1] to facilitate the reproduction of the results.

verification is based on testing all possible combinations of


input using modelsim. All designs were synthesized using
synopsys design-compiler with the Oklahoma State University
0.18 m standard cell library. The synthesis results are shown
in Table IV. To allow for results reproduction, we list below
the commands which we use to get these results assuming that
the Verilog code is stored in a directory called src and the
top level module name is top:
analyze -format verilog -lib WORK
-autoread -recursive ./src
elaborate top -arch "verilog" -lib
WORK -update
link
ungroup -all -flatten -simple_names
compile -map_effort high
We note that we could not reproduce the results of [2] due to
the lack of enough information on the synthesis environment2.
B. Discussion
The results presented in Table IV show that, in terms of
speed, our Three-Four split algorithm achieves 15% and 42%
improvement over the architecture presented in [2] and the
corrected architecture of [6] respectively. Moreover, comparing
with the algorithm presented in [2], the Three-Four split
algorithm achieves 23% and 17% saving in area and power
respectively. The saving in area and power when compared
with the corrected architecture of [6] are 39.8% and 38%
respectively.
For our Four-Three split algorithm the improvement in
speed over the architecture presented in [2] and the corrected
architecture in [2] is 9% and 38.5% respectively. This algorithm achieves 37.5% and 39.8% saving over the algorithm
presented in [2] in terms of area and power respectively. On
the other hand, comparing with the corrected architecture of
[6], the Four-Three split algorithm has less area and power by
51% and 55% respectively.
It should be pointed out that although our Four-Three
split algorithm is slightly slower than the Three-Four split
algorithm, it demonstrates better results in terms of area and
power.
As noted earlier in Section I, the area of the binaryto-BCD conversion circuit is of particular importance. The
2 Furthermore, note that the power results presented in [2] are in nW . We
believe this is a typo and it should be W .

230

reason is that we need n2 such conversion circuits to perform


n n digit fully parallel decimal multiplication. Therefore,
any improvement in the area of the digit-by-digit conversion
circuit grows into sizable improvement as the number of
digits increases. For example, for 32-digit by 32-digit decimal
multiplication, the portion of the multiplication circuit that is
responsible for binary-to-BCD conversion can be smaller by
259,072 m2 when using the Four-Three split algorithm than
it is when using the Three-Four split algorithm. Further, the
size of the conversion circuit portion is smaller by 674,816
m2 when using the Four-Three split algorithm than it is when
using the algorithm proposed in [2].
V. C ONCLUSIONS

AND

F UTURE W ORK

This paper presents a set of algorithms and architectures for


converting a 7-bit binary partial product into a 2-digit BCD
partial product to be used in decimal multiplication circuits.
We present two algorithms based on splitting the 7-bit binary
input into two parts. The first algorithm (Three-Four split
algorithm) splits the input into the three most significant bits
and the least four significant bits. The contribution of the two
bit groups is computed and added together appropriately to
form the 2-digit BCD partial product. The second algorithm
(Four-Three split algorithm) splits the 7-bit binary partial
product into the most four significant bits and the least three
significant bits.
We demonstrate in this paper an improvement in speed of
15% for the Three-Four split algorithm and 9% for the FourThree split algorithm over the fastest algorithm presented in
the literature (i.e., [2]). More importantly, we demonstrate an
improvement in the area of the conversion circuit of 23%
for the Three-Four split algorithm and 37.5% for the FourThree split algorithm as compared to the state of the art
conversion architecture described in [2]. In terms of power
and comparing with the algorithm presented in [2], the ThreeFour split algorithm achieves 17% of power saving and the
Four-Three split algorithm achieves 39.8% of power saving.
Our future plans include studying the performance of various decimal multiplication schemes in the literature especially
with the existence of efficient conversion circuits we present
in this paper.

[6] G. Jaberipur and A. Kaivani. Binary-coded decimal digit multipliers.


Computers Digital Techniques, IET, 1(4):377 381, 2007.
[7] R.K. James, T.K. Shahana, K.P. Jacob, and S. Sasi. Decimal multiplication using compact bcd multiplier. In Electronic Design, 2008. ICED
2008. International Conference on, pages 1 6, 2008.
[8] R.D. Kenney, M.J. Schulte, and M.A. Erle. A high-frequency decimal
multiplier. In Computer Design: VLSI in Computers and Processors,
2004. ICCD 2004. Proceedings. IEEE International Conference on,
pages 26 29, 2004.
[9] V. T. Rhyne. Serial binary-to-decimal and decimal-to-binary conversion.
IEEE Trans. Comput., 19:808812, September 1970.
[10] M. S. Schmookler. High-speed binary-to-decimal conversion. IEEE
Trans. Comput., 17:506508, May 1968.
[11] A. Vazquez, E. Antelo, and P. Montuschi.
A new family of
high.performance parallel decimal multipliers. In Computer Arithmetic,
2007. ARITH 07. 18th IEEE Symposium on, pages 195 204, 2007.
[12] M.P. Veandstias and H.C. Neto. Parallel decimal multipliers using
binary multipliers. In Programmable Logic Conference (SPL), 2010
VI Southern, pages 73 78, 2010.
[13] S. Veeramachaneni and M.B. Srinivas. Novel high-speed architecture for
32-bit binary coded decimal (bcd) multiplier. In Communications and
Information Technologies, 2008. ISCIT 2008. International Symposium
on, pages 543 546, 2008.

R EFERENCES
[1] http://www.just.edu.jo/ oda/research/comp arith/decimal/bin2bcd/.
[2] J. Bhattacharya, A. Gupta, and A. Singh. A high performance binary
to BCD converter for decimal multiplication. 2010 International
Symposium on VLSI Design, Automation and Test, 2010.
[3] Fadi Busaba, Timothy Slegel, Steven Carlough, Christopher Krygowski,
and John G. Rell. The design of the fixed point unit for the z990
microprocessor. In Proceedings of the 14th ACM Great Lakes symposium
on VLSI, GLSVLSI 04, pages 364367, New York, NY, USA, 2004.
ACM.
[4] M.A. Erle and M.J. Schulte. Decimal multiplication via carry-save
addition. In Application-Specific Systems, Architectures, and Processors,
2003. Proceedings. IEEE International Conference on, pages 348 358,
2003.
[5] M.A. Erle, E.M. Schwarz, and M.J. Schulte. Decimal multiplication
with efficient partial product generation. In Computer Arithmetic, 2005.
ARITH-17 2005. 17th IEEE Symposium on, pages 21 28, 2005.

231

Anda mungkin juga menyukai