Anda di halaman 1dari 13

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1

Design-Efficient Approximate Multiplication


Circuits Through Partial Product Perforation
Georgios Zervakis, Kostas Tsoumanis, Student Member, IEEE, Sotirios Xydis,
Dimitrios Soudris, and Kiamal Pekmestzi

Abstract Approximate computing has received significant truncation [4], [5], voltage overscaling (VOS) [2], [6], and
attention as a promising strategy to decrease power consump- simplification of logic complexity (i.e., alteration of the truth
tion of inherently error tolerant applications. In this paper, table) [7][9]. Extensive research has been conducted on
we focus on hardware-level approximation by introducing the
partial product perforation technique for designing approximate approximate adders [6], [7], [10], [11], providing significant
multiplication circuits. We prove in a mathematically rigor- gains in terms of area and power while exposing small
ous manner that in partial product perforation, the imposed error. However, research activities on approximate multipli-
errors are bounded and predictable, depending only on the ers are limited. Efficient approximate multipliers introduced
input distribution. Through extensive experimental evaluation, in [8], [9], [12], and [13] target the approximation of the partial
we apply the partial product perforation method on different
multiplier architectures and expose the optimal architecture product accumulation but do not examine approximations on
perforation configuration pairs for different error constraints. the partial product generation.
We show that, compared with the respective exact design, the Approximate hardware circuits, contrary to software
partial product perforation delivers reductions of up to 50% approximations, offer transistors reduction, lower dynamic
in power consumption, 45% in area, and 35% in critical delay. and leakage power, lower circuit delay, and opportunity for
In addition, the product perforation method is compared with the
state-of-the-art approximation techniques, i.e., truncation, voltage downsizing. Motivated by the limited research on approximate
overscaling, and logic approximation, showing that it outperforms multipliers, compared with the extensive research on approxi-
them in terms of power dissipation and error. mate adders, and explicitly the lack of approximate techniques
Index Terms Approximate arithmetic circuits, approximate targeting the partial product generation, we introduce the
computing, approximate multiplier, error analysis, low power. partial product perforation method for creating approximate
multipliers. Inspired from [14], we omit the generation of some
partial products, thus reducing the number of partial products
I. I NTRODUCTION that have to be accumulated, we decrease the area, power, and
depth of the accumulation tree. The major contributions of this
I N MODERN embedded electronic devices, power con-
sumption is a first-class design concern. Considering that a
large number of application domains are inherently tolerant to
paper are summarized as follows.
1) We adopt and apply, for the first time, the software-based
imprecise calculations, e.g., digital signal processing (DSP), perforation technique [14] on the design of hardware
data analytics, and data-mining [1], approximate computing circuits, obtaining the optimized design solutions regard-
appear as a promising solution to reduce their power dissi- ing the powerareaerror tradeoffs.
pation. Such applications process large redundant data sets 2) We analyze in a mathematically rigorous manner the
or noisy input data derived from the real world, do not have arithmetic accuracy of partial product perforation and
a golden result, perform statistical/probabilistic computations, prove that it delivers a bounded and predictable output
and/or demand human interaction, thus their exactness is error. Our error analysis is not bound to a specific
relaxed due to limited human perception [2], [3]. Approximate multiplier architecture and can be applied with error
computing can be applied at both software and hardware guarantees to every multiplication circuit regardless of
levels. its architecture. Such a rigorous analysis enables precise
Hardware-level approximation mainly targets arithmetic error estimation over input data distributions.
units, such as adders and multipliers, widely used in portable 3) We explore and characterize the efficiency of the
devices to implement multimedia algorithms, e.g., image product perforation method on several multiplier
and video processing. The most commonly used techniques schemes, exposing its powerarea impact on differ-
for the generation of approximate arithmetic circuits are ent architectures. This is the first time that such
an exploratory analysis over different approximate
Manuscript received September 17, 2015; revised January 4, 2016; accepted multiplier architectures is offered to the designer,
February 9, 2016. This work has been partially supported by the E.C. program
AEGLE under H2020 Grant Agreement No: 644906. enabling also the selection of the optimum architecture
The authors are with the Department of Electrical and Computer Engi- perforation configuration for given error constraints.
neering, National Technical University of Athens, Athens 15780, Greece 4) We show that the partial product perforation outper-
(e-mail: zervakis@microlab.ntua.gr; kostastsoumanis@gmail.com; sxydis@
microlab.ntua.gr; dsoudris@microlab.ntua.gr; pekmes@microlab.ntua.gr). forms the related state-of-the-art works in terms of
Digital Object Identifier 10.1109/TVLSI.2016.2535398 power consumption and error, as well as output quality,
1063-8210 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

when applied to image processing and data analytics the applications characteristics, and in addition, the induced
algorithms. approximation error is not rigorously bounded.
More specifically, we apply the partial product perfora- Extensive research has been conducted targeting the
tion on 16 different multiplier architectures using industrial implementation of approximate adders [7], [10], [11].
strength tools, i.e., Synopsys Design Compiler and PrimeTime. Verma et al. [11] developed a probability proof, estimating
Through extensive experimental evaluation, we present the that the longest carry chain in an n-bit adder is logn, and
optimal approximate multiplier configurations for various error produced a fast inexact adder limiting the carry propagation.
constraints. We show that, compared with the accurate multi- In [10], approximation is performed by decomposing the
plier, the product perforation offers reductions of up to 50% addition circuit in an accurate and an approximate inaccurate
in power consumption, 45% in area, and 35% in critical part. Gupta et al. [7] build imprecise full adder cells, requiring
delay for 0.1% normalized mean error distance (NMED) [15]. fewer transistors, by approximating their logic function and
Moreover, it is compared with the state-of-the-art approximate then use them to build imprecise adders. Although it is
computing works that use either VOS [6], logic approxima- proposed to use such adders targeting to build approximate
tion [9], or truncation [4], outperforming them significantly multipliers, it is not clear how they can be used in different
in terms of power dissipation and error. Finally, we examine tree architectures and how their error scales in the case
the scalability of our technique by applying it on different of multioperand addition. Targeting the creation of approx-
bit-width multipliers and show that the delivered savings imate multipliers, Kulkarni et al. [8] proposed a simplified
increase with the width increase. imprecise 2 2 multiplier cell used as the basic block for
The rest of this paper is organized as follows. In Section II, constructing larger multiplier architectures. Momeni et al. [9]
we discuss the related literature with an emphasis on circuit- presented two approximate 4:2 compressors by modifying
level approximation. Section III introduces the partial product the respective accurate truth table, which were then used
perforation technique, providing the corresponding error to build two approximate multipliers outperforming [8]. The
analysis and error correction methods. In Section IV, we approximate compressors of [9] are used in Dadda tree with
examine the product perforation on different multiplier archi- 4:2 reduction. However, different multiplier architectures were
tectures, exposing the optimal architectureperforation con- not explored. Based on an approximate adder that limits the
figuration pairs under differing error constraints. Section V carry propagation, Liu et al. [13] presented a fast and low-
evaluates the product perforation method by comparing it with power multiplier scheme with higher error than [9]. However,
the related state-of-the-art works. Finally, the conclusion is in all the aforementioned approaches, the imposed error cannot
drawn in Section VI. be predicted, as it depends on carry propagation and the
circuits implementation, and requires simulations over all
possible inputs in order to be calculated.
II. R ELATED W ORK
Recently, Narayanamoorthy et al. [17] and
In this section, the related research in the field of hard- Hashemi et al. [18] proposed the use of m m multipliers
ware approximate computing is discussed. Both general- to perform an n n multiplication (with m < n).
purpose approximation techniques [4], [6], [16] applied to any Narayanamoorthy et al. [17] statically split the multiplicand in
arithmetic circuit and circuit-specific approximation either to three m-bit segments and perform the multiplication utilizing
adder [7], [10], [11] or multiplier designs [8], [9], [13], the segment containing the most significant 1 (leading one).
[17], [18] have been presented. However, as stated in [18], m needs to be at least n/2 to attain
Regarding the general approximation techniques, acceptable accuracy, thus limiting the energy savings and the
VOS [2], [6] and truncation [4], [5], [12] have been scalability of this approach. Hashemi et al. [18] extended
proposed. VOS is applied in any circuit by lowering the the idea of leading-one segments to enable dynamic range
supply voltage below its nominal value. Decreasing the multiplication and added a correction term. Although [18]
supply voltage reduces the circuits power consumption, delivers higher accuracy designs than [17] using smaller
but produces errors caused by the number of paths that values for m, its approach requires the allocation of extra
fail to meet the delay constraints [2]. Banescu et al. [12] complex circuitry, i.e., two leading-one detectors, two
proposed an automated generation of large precision floating- complex multiplexers for segment selection, one log(n)-bit
point multipliers in field-programmable gate arrays using comparator, a log(n)-bit adder, and one 2n-bit barrel shifter.
sophisticated truncation over underutilized DSPs. In [5], These extra components are expected to highly increase
a truncated multiplier with a constant correction term is the circuits complexity, introducing nontrivial delay, area,
proposed, significantly decreasing the error imposed by typical and energy overheads that may considerably decrease the
truncation. King and Swartzlander [4] proposed a truncated approximation benefits [17]. This is expected to be more
multiplier with variable correction that outperforms [5] in evident in designs targeting too small error values, in which
terms of error. Probabilistic pruning and logic minimization the need for larger m values is required.
techniques have been presented in [16] using a greedy In this paper, we target the design of powererror efficient
approach to generate approximate circuits. These techniques multiplication circuits. We differ from the previous works
systematically eliminate circuits components and simplify by exploring approximation on the generation of the partial
logic complexity according to the circuits activity profile and products. The proposed method can be easily applied in any
output significance. Both the techniques heavily depend on multiplier architecture without the need for a special design,
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

ZERVAKIS et al.: DESIGN EFFICIENT APPROXIMATE MULTIPLICATION CIRCUITS 3

Fig. 1. Partial product reduction process for 8 8 multiplication with (a) accurate array, (b) approximate array, (c) accurate Wallace, (d) approximate
Wallace, (e) accurate compressor 4:2, (f) approximate compressor 4:2, (g) accurate Dadda 4:2, and (h) approximate Dadda 4:2. Approximation is performed
by perforating the third and fourth partial products. The boxes with four dots are 4:2 compressors, those with three are full adders and those with two are
full- or half-adders.

in contrast to related works. In addition, the error imposed by approximate multiplication is given by
perforation depends only on the configuration parameters and,
in contrast to existing work, can be analytically calculated 
n/21
A B| j,k = Ab iM B 4i , b iM B {0, 1, 2}. (3)
without the need for exhaustive simulations. The latter is
i=0
critical, as, given the applications inputs, a precise estimation i [
/ j, j +k)
of the output quality can be extracted. Finally, the knowledge Fig. 1 shows an example of applying the partial product
of the induced error permits the selection of the configuration perforation method on different 8-bit multipliers with j = 2
that maximizes the power savings for a specific error bound. and k = 2 configuration values. For each architecture, the dot
diagrams [19] of the accurate and the respective perforated tree
III. A NALYZING PARTIAL P RODUCT P ERFORATION are presented. The dots represent the bits of the partial prod-
ucts that have to be accumulated, while the stages represent
A. Method Analysis the delay of the reduction process followed by each tree. The
In this section, the partial product perforation method for dashed boxes with four dots are 4:2 compressors, those with
the design of approximate hardware multipliers is described. three are full adders and those with two are either full- or
Consider two n-bit numbers A and B. The result of their half-adders. Through the proposed approximation technique,
multiplication A B is obtained after summing all the partial the power, area, and delay of the multiplication circuit are
products Abi , where bi is the i th bit of B. Thus decreased, making, though, the computation imprecise. The
higher the order of a perforated partial product, the greater the

n1 error imposed at the final result. In addition, since the addition
AB = Abi 2i , bi {0, 1}. (1) is an associative and commutative operation, when more than
i=0 one partial products are perforated, the total error results from
The partial product perforation technique omits the genera- the addition of the errors produced from the perforation of
tion of k successive partial products starting from the j th one. each partial product separately.
A perforated partial product is not inserted in the accumulation We use the notation D[j,k,c] to label the different approxi-
tree, and hence n full adders can be eliminated. Applying the mate multiplier architectural configurations. The parameter D
product perforation with j and k configuration values on the refers to the tree architecture, j is the order of the first per-
multiplication, A B produces the approximate result forated partial product, and k is the number of the perforated
partial products. If no j and k are specified, the respective

n1
notation refers to the exact design. Finally, c corresponds to
A B| j,k = Abi 2i , bi {0, 1}. (2) the partial product generation technique and takes the value s
i=0, for simple partial products (SPPs) or m for MBE. For example,
i [
/ j, j +k)
Fig. 1(a) shows the array[s] configuration, while Fig. 1(b)
Note that j [0, n 1] and k [1, min(n j, n 1)]. shows the array[2,2,s] configuration.
Similarly, when modified booth encoding (MBE) [19] is The partial product perforation should not be confused
used for generating the partial products, the result of the with the truncation technique. Truncation eliminates the circuit
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

that produces specific least significant bits (LSBs) of the Assuming that ED A is the sum of EDs B for a given A,
accumulation tree, while the perforation skips the generation we have
of partial products and thus decreases the number of operands  
ED A = ED(A, B) = 2nk xB2 j A
to be accumulated. For example, in an 8-bit array multiplier,
B x B
perforating a partial product removes eight full adders from
2 2 (2k
n j 1)A
the accumulation tree and reduces its delay. In order to attain = (8)
similar circuit reduction using truncation, 6 LSB have to be 2
truncated. However, truncating 6 LSB does not offer any delay and the sum of all EDs is
n
  2n 2 j (2k 1)A 2 1
reduction. Moreover, in this example, the truncation delivers, 2n 2 j (2k 1) 
in all the cases, incorrect results, whereas the outputs of perfo- ED A = = A
2 2
ration are 50% correct. Finally, perforating one partial product A A A=0
(out of eight) results in a 12.5% loss of information while 2 j 22n (2k 1)(2n 1)
truncating 6 LSB (out of 16) results in a 37.5% information = .
4
loss. In Section V, the perforation and truncation techniques (9)
are quantitatively compared in greater detail regarding error
and power metrics, in order to further expose their differences. Using (9), (7) equals
2 j 22n (2k 1)(2n 1) 2 j (2k 1)(2n 1)
MED = = .
B. Error Analysis 22n 4 4
A critical issue for the approximate computing is the error (10)
imposed during computations and how it affects the final Thus
result. In this section, an error evaluation analysis of the partial MED 2 j (2k 1)
product perforation technique is presented. We evaluate the NMED = = . (11)
(2n 1)2 4(2n 1)
induced error metrics proposed in [15], i.e., ED, MED, and
NMED, as effective metrics for quantifying the accuracy of Similarly
approximate arithmetic circuits. ED is defined as the absolute ED(A, B) xB2 j
RED(A, B) = = (12)
distance of the fully accurate product P and the approximate AB B
one P  , ED = |P P  |. The MED is the average of EDs for and
all inputs and NMED = MED/Pmax , where Pmax = (2n 1)2
2n  x B 2 j  xB2 j
in the case of an n-bit multiplier [13]. The relative error MRED = 2n
= . (13)
distance (RED) is defined as RED = ED/P, and the mean 2 B 2n B
B B
RED (MRED) is similarly obtained [13]. The previous analysis provides rigorous expressions of error
1) Error Evaluation: When applying the product perfora- metrics, enabling a fast error analysis of differing product
tion on an n-bit multiplier using SPP generation, the ED of perforation configurations. As shown in Section IV, these
multiplying two numbers A and B is calculated as follows: analytical error expressions are used in an exploration loop for

n1 
n1 deriving optimized approximate design solutions. The analyti-
ED(A, B) = |P P  | = A bi 2i A bi 2i cal equations (11) and (13) consider uniform distribution; thus
i=0 i=0, in the case of differing distributions,1 they should be adjusted
i [
/ j, j +k)
according to the new PDFs, since the powererror efficiency
j +k1
 of approximate designs highly depends on the multipliers
= A 2i bi = A2 j x B (4) operands distribution. In most applications, e.g., multimedia,
i= j the inputs are highly correlated [16]. As an intuitive example,
where x B [0, 2k ) and Fig. 2(a) shows the powerNMED Pareto graph for a 16-bit
Dadda 4:2 multiplier when A and B follow the uniform

k1
distribution over the overall range of n-bit numbers, while
xB = 2i b j +i = B/2 j  mod 2k . (5)
Fig. 2(b) shows the same graph with inputs derived from the
i=0
GSM 06.10 audio benchmark [20]. As shown, increasing the
If p A and p B are the probability density functions (PDFs) of k-values results in lower power consumption but increased
A and B, respectively, then the MED is calculated from error values, while the selection of the j-value mostly depends

MED = p A (A) p B (B)ED(A, B). (6) on the input distribution. Intuitively, for a uniform distribution
A,B over all possible n-bit numbers [Fig. 2(a)], where all the
bits have equal probability of being one or zero, j should
Without loss of generality, the rest of our analysis consid-
be kept small to minimize the error. This is also confirmed
ers a uniform distribution over the overall n-bit numbers,
from Fig. 2(a), where 58% of the Pareto configurations feature
i.e., (A, B) [0, 2n )2 . Hence, p A (A) = 1/2n A and
j = 0 and 42% of the Pareto configurations feature j = 1.
p B (B) = 1/2n B. Therefore, MED is given from
However, as shown in Fig. 2(b), when the inputs are correlated
 ED(A, B) 1 
MED = = ED(A, B). (7) 1 In the case of different input distributions, starting from (6), we apply the
2n 2n 22n
A,B A B same steps given the respective PDFs of the input operands.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

ZERVAKIS et al.: DESIGN EFFICIENT APPROXIMATE MULTIPLICATION CIRCUITS 5

If A and B follow the uniform distribution in [0, 2n ),


(14) equals:

 xB A  xAB
MED = 2 j
+
n
2 2 n 2n 2n
A,B: A,B:
x A x B x A <x B

2j  
= x A + 2 x B . (15)
22n
B A
A,B: A,B:
x A =x B x A <x B

Every number A can be written in the form:


A = M A 2 j +k + x A 2 j + L A , where M A [0, 2n( j +k) ),
x A [0, 2k ), and L A [0, 2 j ). M A and L A are computed
similar to x A .
The sum [S1(y)] of all numbers A that have x A = y, where
y is a constant and y [0, 2k ), is given by
 
Fig. 2. Pareto powerNMED graph of a 16-bit Dadda 4:2 multiplier with S1(y) = A= (M A 2 j +k + x A 2 j + L A )
(a) uniform input distribution in [0, 216 ) and (b) inputs obtained from audio A: A:
x A =y x A =y
benchmarks. All the configurations that feature NMED < 5 105 are   
presented. Next to each point is denoted the respective (j, k) configuration. = (M A 2 j +k + x A 2 j + L A )
M A x A =y L A
without following a uniform distribution, we observe that the n( j +k) 1)2n( j +k)
j (2
Pareto front is formed by configurations featuring many differ- =2 2 j +k
ent j values, i.e., 0, 2, 6, and 15. The previous example shows 2
(2 j 1)2 j
that there is not a golden value for j and k, but their selection + 2n( j +k) 2 j y2 j + 2n( j +k)
. (16)
highly depends on the error constraints and the inputs PDF. 2
2) Error Correction Methods: In this section, we introduce Supposing that B is fixed and x B = z, we get that
two methods to decrease the error induced from the applica-  
tion of partial product perforation. They are implemented as 2 x A B = 2nk 2B x A = 2nk z(z 1)B (17)
A: x A <z
extra components complementing the multiplication circuit, x A <z
thus their area, power, and delay overheads as well as the
error reduction they offer, do not depend on the architecture and
 
of the multiplier. Although multiplication is commutative, zA = z A = z S1(z). (18)
i.e., A B = B A, this does not apply in perforated A: A:
multipliers. From (4), when multiplying A B, the imposed x A =z x A =z
error is proportional to the multiplicand A and the term x B By evaluating (17) for all B, we obtain
and thus decreasing one of these operands decreases the error  
delivered to the output. As a result, comparing A and B or x A 2 xAB = 2nk z(z 1)B
and x B before the multiplication and swapping accordingly, A,B: B
x A <x B
A and B can reduce the error. j 1
2
Method 1 (Comparing x A and x B ): In this method,
=2 nk
z(z 1)S1(z). (19)
x A and x B are compared before the multiplication, and if
z=0
x B > x A , A and B are swapped. Therefore, the imposed error
is ED(A, B) = A2 j x B , when x A x B , and ED(A, B) = By evaluating (18) for all B, we obtain
B2 j x A , when x B > x A . Hence, MED equals j 1
   2

MED = p A (A) p B (B)ED(A, B) xB A = x B S1(x B ) = 2nk z S1(z). (20)


A,B A,B: B z=0
x A =x B

 Using (19) and (20), (15) is equal to


= 2j
p A (A) p B (B)x B A j
2 1
A,B: 2 j 2nk  2
x A x B MED = z S1(z) (21)
22n
z=0
 j
21
+ p A (A) p B (B)x A B
. (14) 2 j 2nk
and NMED = 2n n z S1(z).
2
(22)
A,B: 2 (2 1)2
x A <x B z=0
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

The sum of all REDs is given by



  xB  xA
RED(A, B) = 2 j
+
B A
A,B A,B: A,B:
x A x B x A <x B

 xB  xB
= 2j
+ 2 . (23)
B B
A,B: A,B:
x A =x B x A >x B

Denoting C I = 2k 1 and using that


 xB   xB  xB

= = 2nk (C I x B )
B B B
A,B: B A: B
x A >x B x A >x B
(24)
and Fig. 3. Percentage reduction of (a) NMED and (b) MRED achieved by the
 xB   xB  xB
correction Methods 1 and 2 with respect to the NMED and MRED values
= = 2nk (25) obtained by product perforation without correction. The x-axis contains all
B B B the [j, k] configurations.
A,B: B A: B
x A =x B x A =x B

(23) is equal to the product perforation configurations (j, k). Fig. 3(a) shows
  xB
the NMED reduction attained by the correction methods with
RED(A, B) = 2 j 2nk (1 + 2(C I x B )) respect to the NMED of product perforation without an error
B
A,B B
correction method. Fig. 3(b) shows the respective graph for
n 1
2 xB
the MRED metric. The proposed corrective methods offer both
= 2 j 2nk (1 + 2(C I x B )) NMED and MRED reduction. Method 1 offers higher NMED
B
B=1 reduction, while Method 2 achieves higher MRED reduction.
(26)
On average, Method 1 offers 30% NMED reduction and
and MRED is calculated as a relation of j and k from 24% MRED reduction, while Method 2 offers 26% reduction
and 50% reduction, respectively. As a result, the selection of
2j  xB

2 1 n
a corrective method depends on the application in which the
MRED = n+k
(1 + 2(C I x B )) . (27) perforated multiplier will be used. If the magnitude of the
2 B
B=1
error is more important than its absolute distance from the
Method 2 (Comparing A and B): In this method, A and B accurate result, then Method 2 should be preferred; if not, then
are compared before the multiplication, and if A > B, A and B Method 1 should be selected. However, the implementation
are swapped. As a result the induced error ED(A, B) = of Method 1 requires a k-bit comparator, while Method 2
A2 j x B , when A B and ED(A, B) = B2 j x A , when A > B. requires an n-bit one, and thus Method 1 induces smaller area
Similar to Method 1 and power overheads. As a result, since both the methods

offer significant NMED and MRED reductions and Method
2j   1 induces less power overhead, it should be preferred in the
MED = x A + x A B
22n
B case the application is unknown.
A,B: A,B: Methods 1 and 2 decrease the error metrics, but their imple-
AB A>B
mentation requires an additional comparator. Fig. 4 shows the
2j    impact of correction Method 1 or Method 2 on the delay,
= xA A + 2 x A B power, and area on the Dadda 4:2 multiplier, with respect to
22n
A,B: A B: the accurate design. Since the complexity of the comparator is
A=B B<A
n 1
2 mainly affected by the perforation variable k, Fig. 4 shows the
2 j
= x A A2 (28) perforation configurations that feature j = 1 and k = 1 to 8
22n (similar results are obtained for other j and for MBE designs).
A=1
n 1 As expected, using Method 1 with perforation induces 13%
2 j 2A=1 x A A2
NMED = (29) overhead on critical delay, but also retains 26% and 20%,
22n (2n 1)2 on average, power saving and area saving, respectively. The
2j  xB

2n 1
respective values for Method 2 are 20%, 26%, and 17%.
and MRED = 2n + 2x B . (30) The NMED and MRED analytical relations show that the
2 B
B=1 error imposed by the product perforation method is bounded
Fig. 3 shows the error improvement achieved by and predictable. Therefore, when the applications input data
Methods 1 and 2, for a 16-bit (n = 16) multiplier and all set is determined, it can be used to calculate the optimal
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

ZERVAKIS et al.: DESIGN EFFICIENT APPROXIMATE MULTIPLICATION CIRCUITS 7

Fig. 4. Normalized delay, power, and area metrics achieved by applying


product perforation with correction and with j = 1 and k = 1 . . . 8 on a
Dadda 4:2 multiplier, with respect to those of the accurate design.

Fig. 5. Flow used to evaluate the partial product perforation method on


combination of j and k that produces an error less than a different multiplier architectures.
desired upper bound.
IV. E XPLORING THE E FFICIENCY OF PARTIAL
P RODUCT P ERFORATION
In this section, the partial product perforation method is
applied to various multiplier architectures in order to explore
how their power consumption, area, delay, and accuracy
behave, considering the perforation configuration variables j
Fig. 6. Powerarea Pareto curves for different NMED values.4
and k. This analysis targets to expose the optimal architecture
configuration pair for determined error values regarding both
power dissipation and area complexity. This is critical, since decreased by 112 area units (au) and its delay by 8 time
different configurations may not have the same impact on a units (tu). The respective values for the Wallace tree are 115 au
multiplier architecture, e.g., an architecture may be the power and 4 tu. The delay of the Dadda 4:2 and compressor 4:2
optimal one when accurate calculations are performed, but is not decreased, but their area decrease is 127 and 112 au,
suboptimal when partial product perforation is applied. respectively.
Both the SPP and MBE techniques are considered in our Exploration and Analysis: The flow used for our evaluation
analysis. Regarding the accumulation tree, the most common is summarized in Fig. 5. For our analysis, 16-bit unsigned3
architectures are used: 1) array; 2) balanced delay; 3) compres- multiplier architectures are considered. They are implemented
sor 4:2; 4) counter 7:3; 5) Dadda; 6) Dadda with 4:2 compres- in structural Verilog and synthesized using Synopsys Design
sors; 7) redundant binary; and 8) Wallace [19], [21], [22]. The Compiler and the TSMC 65-nm standard cell library. We sim-
array is the simplest way to accumulate the partial products. ulate the designs using Modelsim and calculate their power
It consists of successive carry-save adders (CSAs) and has consumption with Synopsys PrimeTime triggering the aver-
the least complexity but the highest delay. The Wallace tree age mode of calculation. All the possible combinations of
reduces to the least possible the number of partial products j and k are explored, and 1376 architectural configurations
in each layer and is theoretically the fastest multioperand are examined in total. The metrics measured for each design
adder. However, it has very complex interconnections that do are the NMED, MRED, minimum delay and, at the relaxed
not permit practical implementations. The balanced delay tree period of 2 ns, its power consumption and area complexity.
provides a more regular routing and minimizes the number In [25], a detailed power, area, and delay characterization and
of wiring trucks. The compressor 4:2 tree also has a regular the analysis of the examined perforated multiplier architectures
structure and sums the partial products as a binary tree does, have been performed showing that the aforementioned metrics
using 4:2 compressors instead of CSAs. Unlike the Wallace are scaling gracefully, i.e., average slope 0.16%, 242%,
tree, Dadda makes the fewest reductions needed in each and 0.03%, respectively, for increased values of k.
layer and can achieve similar overall delay, but requires less Since power, area, and delay metrics scale differently
gates. The Dadda tree is based on 3:2 counters (full adders) for each multiplier architecture when different error values
but also 2:2 counters (half-adders) to reduce the hardware are considered, we illustrate in Fig. 6 that the powerarea
complexity. The Dadda 4:2 and counter 7:3 trees use the same Pareto curves for different NMED values in order to distin-
reduction strategy with the Dadda tree using though 4:2 and guish the optimal designs. We consider the NMED values
7:3 compressors, respectively. In the redundant binary tree, of 104 , 5 104 , and 103 which enclose a large set of dif-
the partial products are in a redundant representation, and the ferent partial product perforation configurations while keeping
addition is performed by redundant binary adders [23] in the the error small. The optimal accurate design is the Dadda[m].
form of a binary tree. A carry look-ahead adder is used as Moreover, the Dadda4:2[m]5 architecture appears in all
the final adder in all multipliers. Fig. 1 shows some typical
3 Applying product perforation to signed multiplication is performed similar
reduction schemes of the aforementioned tree architectures and
to the unsigned one, except that we do not perforate the last partial product.
the respective perforated trees with configuration j = k = 2. Therefore, no extra circuit is needed and similar results are expected.
Using the unit gate model2 [24], the area of the array is 4 The respective MRED values of the designs can be derived in a straight-
forward manner from the error equations presented in Section III-B utilizing
2 Area/delay of a full adder is 7 au/4 tu, of a half adder 3 au/2 tu and of a the annotated j- and k-parameter values.
4:2 compressor 14 au/6 tu. 5 In the remainder, we consider as driver circuit the Dadda 4:2.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Fig. 7. Box plots of power consumption for NMED < 5 104 .

curves but with different product perforation configurations


(i.e., different j and k values), depending on the NMED bound.
With respect to the accurate design, the perforation achieves
up to 50% power, 45% area, and 35% delay reductions for
only 0.1% error (i.e., NMED < 103 ).
Aiming to elucidate the impact of partial product perfora-
tion on each multiplier architecture, we examine their power
variation (i.e., the range of power values) for a bounded error.
Fig. 7 shows the box plot diagram for all the architectures
with regard to power, considering all the product perfora-
tion configurations that result in NMED < 5 104 . The
MBE-based architectures exhibit smaller variation and lower
median than the respective SPP-based ones. The lowest median
and variation values are observed for the counter 7:3[m] archi-
tecture. Thus, its power consumption for various perforation
configurations is concentrated in a smaller range, making its
Fig. 8. Comparison of partial product perforation with ACM1, ACM2 [9],
power behavior more predictable. The same conclusion is TR10, TR16, and VOS for (a) SPP and (b) MBE architectures.
confirmed in Fig. 6, where the counter 7:3[m] for NMED
values 5 104 and 103 is the Pareto optimal point with
the lowest power.
the 16 ones. For the perforated multipliers, the error correction
V. E XPERIMENTAL E VALUATION Method 1 (Section III-B2) is used.
Fig. 8 shows comparative results on the power, area,
A. Comparative Study on Circuit Level NMED, and MRED metrics after applying the four different
In this section, we extensively evaluate the efficiency of par- partial product perforation configurations, the approximate
tial product perforation in terms of power, area, and error, and compressors according to the technique presented in [9]
we compare it with the state-of-the-art approximation tech- (ACM1 and ACM2), the VOS technique, and the truncation
niques, which apply truncation [4], logic approximation [9], or (TR10 and TR16) on a 16-bit Dadda 4:2 multiplier using
the VOS technique [6]. Using the two inexact 4:2 compressors SPP [Fig. 8(a)] and MBE [Fig. 8(b)]. The examined perforated
of [9] at the 16 LSB columns, two approximate 16-bit multi- designs exhibit different orders of perforation (j variable) and
pliers ACM1 and ACM2 are implemented in structural Verilog they are on (designs Dadda4:2[0,8,s] and Dadda4:2[1,5,s])
and synthesized at 2 ns using Synopsys Design Complier and or close to (designs Dadda4:2[2,2,s] and Dadda4:2[3,4,s])
PrimeTime. Error metrics calculation is performed through the power-NMED Pareto optimal curve of the Dadda 4:2
exhaustive MATLAB simulation. In order to compare the architecture. Similar selection has been performed for the
partial product perforation with the VOS technique, we use the MBE-based designs.
Synopsys composite current source (CCS) model [26]. CCS The proposed partial product perforation for the SPP-based
models are proven to deliver signoff-level accuracy to within designs, included in Fig. 8(a), delivers power savings of
2% of the HSPICE simulation, are designed to be scalable for up to 49% and area reduction of up to 40% compared
voltage, temperature, and process, and offer better accuracy with the respective accurate design, while the NMED value
than the nonlinear delay and power models [26]. For the exact is 6.5 104 at most and the MRED one goes up to
multiplier architectures of Section IV, we scale the supply 1.1 102 . The respective values for MBE-based configura-
voltage from 1 (nominal) to 0.80 V and measure their power tions [Fig. 8(b)] are 47% power savings, 38% area reduction,
consumption and error metrics using 105 randomly generated NMED 1.8 103 , and MRED 2.5 102 . The approximate
inputs. Regarding truncation, two truncated multipliers with compressors multipliers ACM1 and ACM2 [9] with SPP
variable correction [4] that use the Dadda 4:2 tree to accu- [Fig. 8(a)] have 15% and 20% power, and 15% and 18% area
mulate the partial products are implemented. In the first one savings, respectively, over the accurate Dadda 4:2 multiplier.
(TR10), 10 LSB are truncated, while in the second (TR16), Their NMED values are 2 105 and 1.5 105 , while their
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

ZERVAKIS et al.: DESIGN EFFICIENT APPROXIMATE MULTIPLICATION CIRCUITS 9

MRED ones are 5.3 103 and 5.6 103 , respec-


tively. For the MBE [Fig. 8(b)], ACM1 and ACM2 have
16% and 23% power savings and 8% and 11% area reduc-
tions, respectively, over the accurate Dadda 4:2 multiplier.
Their NMED values are 2.4 104 and 1.6 104 , while
their MRED ones are 17 and 24, respectively. Regarding the
MBE-based designs, [9] is less efficient, since less partial
products compared with the SPP technique are accumulated
in the tree and an error occurring in one column has a greater
impact on the output. VOS does not deliver any area reduction,
offering though significant power savings compared with the
accurate design. When decreasing the supply voltage of the
SPP-based design to 0.80 V [Fig. 8(a)], the power consumption
is 1.06 mW (i.e., 37.9% less than the accurate one). Similarly,
the power consumption of the MBE-based design [Fig. 8(b)]
Fig. 9. (a) PDF of the ED for the ACM2 [9] and the partial product
is 0.94 mW (i.e., 37.7% less than the precise design). However, perforation Dadda 4:2 multiplier with j = 1 and k = 5. ED is in the Q0.32
even for small power savings (10% at 0.95 V), the NMED and number format [fixed point representation of 32-bit integers in the range [0:1)].
MRED values of VOS are too large, more than 0.65 and 10, (b) Respective PDF of the RED.
respectively, as VOS errors are mainly impacting MSBs,
resulting in large ED. The truncated multipliers TR10 and TABLE I
TR16 [4], when SPP is used, offer 14% and 46% power R ANKING OF THE S AVINGS AND E RRORS OF
savings and 18% and 44% area reductions for 1.1 107 and THE A PPROXIMATE M ULTIPLIERS

1.2101 NMEDs and 0.4 and 0.8 MREDs, respectively. The


respective values for the MBE-based designs are 15% and 44%
power savings, 20% and 46% area reduction, 2 105 and
5.0 104 NMEDs, and 4.2 and 4.3 MREDs.
On average, the partial product perforation configurations,
shown in Fig. 8, exhibit lower MRED values than ACM2,
but higher NMED. The large NMED value of partial product
perforation implies that it may produce large ED. However, the
small value of MRED shows that such large ED is insignif-
icant compared with the accurate result. The aforementioned
points can be further explained based on the error analysis
in Section III-B. As shown, the ED is proportional to the
inputs, and thus it can be as large as the input numbers.
However, RED = x B 2 j /B, and since a few partial products consumption and their MRED are 6% and 9% lower, respec-
are removed, the nominator is much smaller than B, resulting tively. For MBE schemes, the respective values are 17% lower
in small relative error values. On the other hand, [9] produces power and three orders of magnitude lower MRED. Compared
smaller ED, but its errors are of greater significance compared with the SPP truncation [4], the perforated multipliers of Fig. 8
with the exact results. This behavior is also captured in deliver on average 3% higher power for 99% lower MRED,
Fig. 9, where the PDF of the ED and RED for ACM2 and while for MBE, the respective values are 4% lower power
Dadda4:2[1,5,s] is presented. ACM2 exhibits lower NMED and two orders of magnitude lower MRED. Finally, Table I
but higher MRED compared with Dadda4:2[1,5,s]. Fig. 9(a) offers a more straightforward comparison among the examined
shows the PDF of the ED for the aforementioned multipliers. approximation schemes, by ranking them according to their
ACM2 has a significantly greater error probability, but its savings and error metrics. The examined designs have been
probable error values are concentrated in a smaller range. grouped in four subgroups, each one with designs exposing
In contrast, the Dadda4:2[1,5,s] errors are spread to a wider similar power and/or error characteristics. In each subgroup,
range and have almost equal, but very low, probability to the perforated multipliers deliver the lowest power and MRED
appear. Fig. 9(b) shows the same graph for the RED metric. values and, in most cases, the lowest NMED and area as well.
As shown in Fig. 9(b), ACM2[s] produces larger RED values
than Dadda4:2[1,5,s] and with greater probability.
To summarize, the partial product perforation technique B. Comparative Study on Real-Life Applications
shows significant gains compared with the accurate design and In this section, we evaluate the efficiency of the proposed
the state-of-the-art approximate techniques. On average, com- technique on real life use cases from the image processing
pared with VOS, the partial product perforation configurations and data analytics domains. For our analysis, we consider the
attain 3% lower power consumption and 96% lower MRED, Canny edge detection [27] and geometric mean filters from
when SPP is used, and 9% and 99%, respectively, when MBE the image-processing domain and the K-means clustering [28]
is used. Compared with [9] for SPP schemes, their power from the data analytics domain, respectively. All the examined
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

10 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Fig. 10. (a) 16-bit input image and the result of the geometric mean filter using (b) accurate multiplier Dadda4:2[s], (c) Dadda4:2[1,5,s] without correction,
(d) Dadda4:2[1,5,s] with correction Method 1, (e) Dadda4:2[3,4,s] without correction, (f) Dadda4:2[3,4,s] with correction Method 1, and (g) ACM2.

Fig. 11. (a) 16-bit input image and the result of the Canny edge detection using (b) accurate multiplier Dadda4:2[s], (c) Dadda4:2[1,5,s] without correction,
(d) Dadda4:2[1,5,s] with correction Method 1, (e) Dadda4:2[3,4,s] without correction, (f) Dadda4:2[3,4,s] with correction Method 1, and (g) ACM2.

algorithms are implemented in C++, while for the image The input data set is clustered in 100 clusters. To evaluate
processing ones, OpenCV library is used. the accuracy of the K-means algorithm, we use the average
Geometric mean filter removes noise from images, offering relative L2-norm, i.e., (|xacc xapprox|2 /|xacc |2 )
.
better results than the arithmetic mean filter for Gaussian-type Similar to [9] and [10], the approximate multiplier is consid-
noise. The geometric mean filter with parameter r filters an ered as part of a general processing system that implements the
image by replacing each pixels value by the geometric mean aforementioned algorithms. The rest of hardware components
of the values of all the neighboring pixels that are inside (except the multiplier) are considered to deliver accurate
a (2r + 1) (2r + 1) block centered on that pixel. For our results, and thus any applications inaccuracy and energy sav-
evaluation, the r parameter is set to 3. We approximate the geo- ings result from the usage of the approximate multiplier. The
metric mean by replacing the multiplication between the pixels energy values of each multiplication operation are delivered
with an approximate 16 16 multiplier. We used as input the by postsynthesis simulations of the approximate multipliers on
16-bit (16 bits/pixel) grayscale image, as shown in Fig. 10(a). the input data traces extracted by the applications execution.
To evaluate the accuracy of the output images of the geometric Note that in the Canny edge detection and geometric mean
mean, we use the peak signalnoise ratio (PSNR). algorithms, the number of the multiplications depends only
Canny edge detection [27] filter is considered to be an on the image size, and thus it is the same for the accurate as
optimal edge detector. In particular, it masks the image by well as the approximate version of the algorithm. On the other
applying a Gaussian filter to remove the noise, it calculates hand, the iterations performed by the K-means algorithm are
the gradient of the image to find the edge strength, it applies not constant, and as a result, the number of multiplications
a nonmaximum suppression to keep only the local maxima, it in the accurate may differ from the ones in the approximate
determines the potential edges by thresholding, and it tracks version.
edges by hysteresis, i.e, suppresses all the edges that are weak Fig. 10 shows both the input image and the output image of
and not connected to strong edges. The size of the Gaussian the geometric mean filter when using the accurate multiplier
kernel is 7 7 with 1.1 standard deviation value and uses Dadda4:2[s], the perforated multipliers Dadda4:2[1,5,s] and
16-bit fixed point arithmetic. We approximate Canny edge Dadda4:2[3,4,s] with and without any correction method and
by replacing the multiplication in the Gaussian filter with an the approximate multiplier ACM2. Fig. 11 shows the same
approximate 16 16 multiplier. We used as input the 16-bits images for the Canny edge detection. Table II summarizes
grayscale image, shown in Fig. 11(a). The percentage of the the values of the energy savings and quality metrics of each
edges detected using the approximate multiplier over those application when using the aforementioned multipliers.
detected using the accurate one is used as our quality metric. The use of the Dadda4:2[1,5,s] multiplier results in
K-means is a popular algorithm for clustering data points 85.95-dB PSNR for the geometric mean and 91.04% edges
from a multidimensional space into k clusters. It uses a detected for the Canny edge detection. The application of the
two-phase iterative method and aims to partition the data corrective Method 1 with the Dadda4:2[1,5,s] results in a small
points into sets, so as to minimize the within-cluster sum decrease of the energy savings (7.41%), but delivers better
of distance functions of each point in the cluster to the outputs as the PSNR increases by 2.9% and the edges detected
center. We use the Euclidean distance as a distance function. by 7.6%. The Dadda4:2[3,4,s] multiplier detects the 84.79%
We approximate the K-means algorithm by replacing the mul- of the edges, and its PSNR is 89.93 dB. The use of correction
tiplications in the calculation of the Euclidean distance with Method 1 with the Dadda4:2[3,4,s] decreases the energy
an approximate 1616 multiplier. We use a random generated reduction by 10%, detects 16.6% more edges, and increases
input data set of 100 000 4-D points with 16 bits/dimension. its PSNR by 3.1%. When ACM2[s] [9] is used, the output
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

ZERVAKIS et al.: DESIGN EFFICIENT APPROXIMATE MULTIPLICATION CIRCUITS 11

TABLE II
E VALUATION OF PARTIAL P RODUCT P ERFORATION IN I MAGE P ROCESSING AND D ATA A NALYTICS A LGORITHMS

image has 86-dB PSNR and 97.85% edges detected. When


we compare Dadda4:2[1,5,s] with ACM2, we observe that the
former offers 25.6% higher energy reduction, detects 7% less
edges, and has the same PSNR as the latter. When we compare
ACM2[s] with Dadda4:2[3,4,s] using Method 1, we find that
the latter delivers 18.6% lower energy savings, detects 1.8%
more edges, and has 7.8% higher PSNR. Finally, when we
compare Dadda4:2[1,5,s] using Method 1 with ACM2[s], the
former achieves 16.3% higher energy reduction, detects 0.5%
more edges, and has 2.8% higher PSNR. Regarding to the
K-means algorithm, using a correction method with prod- Fig. 12. Impact of multipliers bit-width scaling on partial product perfora-
uct perforation does not deliver any quality improvement. tion. (a) Power and area gains for NMED 104 . (b) NMED and MRED
values when targeting 50% power savings.
This is explained by the fact that in the Euclidean distance,
the multiplier is used as a squarer, and as a result, swap-
ping the multiplicands does not decrease the multiplications
power and area, respectively. Similarly, Fig. 12(b) shows that
error. Moreover, we observe that using ACM2[s] in the
for the same relative power gain, i.e., 50%, the 16-bit solution
K-means algorithm does not offer any energy reduction. The
delivers an NMED and MRED value of 1.95 103 and
implementation of the K-means algorithm with ACM2[s] fails
2.61102, respectively. For the 128-bit solution, NMED and
to converge and exits after reaching a maximum number of
MRED reduce to 1.73 1018 and 2.05 1016, respectively.
allowed iterations. As a result, although ACM2[s] has lower
Thus, the partial product perforation offers better results as
power consumption compared with the accurate multiplier,
the multipliers bit width increases, i.e., higher power and area
the increased number of multiplications results in an energy
reduction for the same error constraints or lower error values
increase of the K-means algorithm.
for the same power savings.
This good scaling behavior for increased multipliers bit
C. Impact of Bit-Width Scaling widths can also be theoretically confirmed utilizing the error
In this section, we examine the scalability of the pro- analysis in Section III-B. Let us assume two multipliers M1
posed technique in terms of increased multipliers bit width. and M2 with different bit widths n 1 and n 2 with n 1 < n 2
More specifically, we study the impact of scaled bit widths, having the same j-value for the partial product perforation.
i.e., 16 up to 128 bits, on the proposed perforation tech- For both multipliers to achieve the same NMED, the following
nique focusing on the delivered accuracy (NMED, MRED) relation should hold, according to (11):
and power and area gains. We consider the Dadda 4:2 as 2 j (2k1 1) 2 j (2k2 1) (2n2 1) (2k2 1)
our driver architecture solution and NMED 104 as our = = .
4(2n1 1) 4(2n2 1) (2n1 1) (2k1 1)
quality constraint. Fig. 12(a) shows for each of the examined
(31)
bit widths the power and area reduction delivered by the
perforated Dadda 4:2 solutions with respect to their accu- Given that n 1 < n 2 k1 < k2 . High k-values
rate designs. In a complementary manner and for the same imply the perforation of more partial products. Thus, for two
scaled bit widths, Fig. 12(b) shows the NMED and MRED approximate multipliers with the same NMED but different
values when targeting 50% power reduction. In particular, for bit widths, the higher the multipliers bit width, the higher
NMED 104 , the power and area gains for 16-bit width are the number of partial products that should be perforated, and
21% and 31%, respectively. The respective gains in the case of thus the higher the power gains achieved with respect to their
128-bit width design scales up to 74% and 91% regarding to accurate counterparts.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

12 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

VI. C ONCLUSION [16] A. Lingamneni, C. Enz, K. Palem, and C. Piguet, Synthesizing


parsimonious inexact circuits through probabilistic design techniques,
In this paper, we proposed the partial product perforation ACM Trans. Embedded Comput. Syst., vol. 12, no. 2S, May 2013,
technique for producing approximate hardware multipliers. Art. no. 93.
The proposed technique omits a number of partial products [17] S. Narayanamoorthy, H. A. Moghaddam, Z. Liu, T. Park, and N. S. Kim,
Energy-efficient approximate multiplication for digital signal process-
enabling high area and power savings while retaining high ing and classification applications, IEEE Trans. Very Large Scale
accuracy. Through a rigorous error analysis, we analytically Integr. (VLSI) Syst., vol. 23, no. 6, pp. 11801184, Jun. 2015.
characterized the induced error metrics proving that the error is [18] S. Hashemi, R. I. Bahar, and S. Reda, DRUM: A dynamic range
bounded and predictable and we proposed two error correction unbiased multiplier for approximate applications, in Proc. IEEE/ACM
Int. Conf. Comput.-Aided Design, Nov. 2015, pp. 418425.
methods that trade a small increase in power for high error [19] B. Parhami, Computer Arithmetic: Algorithms and Hardware Designs.
reduction. We explored product perforation on a large set New York, NY, USA: Oxford Univ. Press, 2000.
of multiplier architectures, evaluating its impact on different [20] C. Lee, M. Potkonjak, and W. H. Mangione-Smith, MediaBench:
A tool for evaluating and synthesizing multimedia and communications
architectures and error bounds. In comparison to the state-of- systems, in Proc. 13th Annu. IEEE/ACM Int. Symp. Microarchitecture,
the-art approximation techniques, we showed that the proposed Dec. 1997, pp. 330335.
approach achieves significant gains in power, area, and quality [21] D. Zuras and W. H. McAllister, Balanced delay trees and combina-
metrics of image processing and data analytics algorithms. torial division in VLSI, IEEE J. Solid-State Circuits, vol. 21, no. 5,
pp. 814819, Oct. 1986.
Finally, we showed that our technique is scalable, offering [22] L. Dadda, Some schemes for parallel multipliers, Alta Frequenza,
better results as the multipliers bit width increases. vol. 34, no. 5, pp. 349356, Mar. 1965.
[23] B. Jose and D. Radhakrishnan, Delay optimized redundant binary
adders, in Proc. 13th IEEE Int. Conf. Electron., Circuits Syst. (ICECS),
R EFERENCES Dec. 2006, pp. 514517.
[1] V. K. Chippa, S. T. Chakradhar, K. Roy, and A. Raghunathan, Analysis [24] N. Weste and D. Harris, Datapath subsystems, in CMOS VLSI Design:
and characterization of inherent application resilience for approximate A Circuits and Systems Perspective. Reading, MA, USA: Addison-
computing, in Proc. 50th ACM/EDAC/IEEE Design Autom. Conf., Wesley, 2010.
May/Jun. 2013, pp. 19. [25] G. Zervakis, K. Tsoumanis, S. Xydis, N. Axelos, and K. Pekmestzi,
[2] R. Venkatesan, A. Agarwal, K. Roy, and A. Raghunathan, MACACO: Approximate multiplier architectures through partial product perfora-
Modeling and analysis of circuits for approximate computing, in Proc. tion: Power-area tradeoffs analysis, in Proc. 25th Great Lakes Symp.
IEEE/ACM Int. Conf. Comput.-Aided Design, Nov. 2011, pp. 667673. VLSI, 2015, pp. 229232.
[3] S. T. Chakradhar and A. Raghunathan, Best-effort computing: [26] G. Mekhtarian, Composite Current Source (CCS) Modeling Technology
Re-thinking parallel software and hardware, in Proc. 47th ACM/IEEE Backgrounder. Mountain View, CA, USA: Synopsys, Inc.,
Design Autom. Conf., Jun. 2010, pp. 865870. Nov. 2005.
[4] E. J. King and E. E. Swartzlander, Data-dependent truncation scheme [27] J. Canny, A computational approach to edge detection, IEEE
for parallel multipliers, in Proc. Conf. Rec. 31st Asilomar Conf. Signals, Trans. Pattern Anal. Mach. Intell., vol. PAMI-8, no. 6, pp. 679698,
Syst. Comput., Nov. 1998, pp. 11781182. Nov. 1986.
[5] M. J. Schulte and E. E. Swartzlander, Truncated multiplication with [28] C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis,
correction constant, in Proc. 6th VLSI Signal Process., Oct. 1993, Evaluating MapReduce for multi-core and multiprocessor systems, in
pp. 388396. Proc. IEEE 13th Int. Symp. High Perform. Comput. Archit. (HPCA),
[6] Y. Liu, T. Zhang, and K. K. Parhi, Computation error analysis in Feb. 2007, pp. 1324.
digital signal processing systems with overscaled supply voltage, IEEE
Trans. Very Large Scale Integr. (VLSI) Syst., vol. 18, no. 4, pp. 517526,
Apr. 2010.
[7] V. Gupta, D. Mohapatra, S. P. Park, A. Raghunathan, and K. Roy,
IMPACT: Imprecise adders for low-power approximate computing, in
Proc. Int. Symp. Low Power Electron. Design, Aug. 2011, pp. 409414.
[8] P. Kulkarni, P. Gupta, and M. Ercegovac, Trading accuracy for power Georgios Zervakis received the Diploma degree
with an underdesigned multiplier architecture, in Proc. 24th Annu. Conf. from the Department of Electrical and Com-
VLSI Design, Jan. 2011, pp. 346351. puter Engineering, National Technical University
[9] A. Momeni, J. Han, P. Montuschi, and F. Lombardi, Design and of Athens, Athens, Greece, in 2012, where he is
analysis of approximate compressors for multiplication, IEEE Trans. currently pursuing the Ph.D. degree in digital and
Comput., vol. 64, no. 4, pp. 984994, Apr. 2015. microprocessor system design.
[10] N. Zhu, W. L. Goh, W. Zhang, K. S. Yeo, and Z. H. Kong, Design of His current research interests include approxi-
low-power high-speed truncation-error-tolerant adder and its application mate computing, VLSI arithmetic circuits, low-
in digital signal processing, IEEE Trans. Very Large Scale Integr. (VLSI) power design, and cryptography.
Syst., vol. 18, no. 8, pp. 12251229, Aug. 2010.
[11] A. K. Verma, P. Brisk, and P. Ienne, Variable latency speculative
addition: A new paradigm for arithmetic circuit design, in Proc. Design,
Autom. Test Eur., Mar. 2008, pp. 12501255.
[12] S. Banescu, F. de Dinechin, B. Pasca, and R. Tudoran, Multipliers for
floating-point double precision and beyond on FPGAs, ACM SIGARCH
Comput. Archit. News, vol. 38, no. 4, pp. 7379, Jan. 2011. Kostas Tsoumanis (S12) received the Diploma
[13] C. Liu, J. Han, and F. Lombardi, A low-power, high-performance degree from the Department of Electrical and Com-
approximate multiplier with configurable partial error recovery, in Proc. puter Engineering, National Technical University
Conf. Design, Autom. Test Eur., Mar. 2014, Art. no. 95. of Athens, Athens, Greece, in 2010, where he is
[14] S. Sidiroglou-Douskos, S. Misailovic, H. Hoffmann, and M. Rinard, currently pursuing the Ph.D. degree.
Managing performance vs. accuracy trade-offs with loop perforation, He has co-authored research papers in interna-
in Proc. 19th ACM SIGSOFT Symp., 13th Eur. Conf. Found. Softw. tional conferences. His current research interests
Eng. (ESEC/FSE), Sep. 2011, pp. 124134. include hardware-efficient implementation of arith-
[15] J. Liang, J. Han, and F. Lombardi, New metrics for the reliability of metic operations and low-power design of digital
approximate and probabilistic adders, IEEE Trans. Comput., vol. 62, signal processing algorithms.
no. 9, pp. 17601771, Jun. 2012.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

ZERVAKIS et al.: DESIGN EFFICIENT APPROXIMATE MULTIPLICATION CIRCUITS 13

Sotirios Xydis received the Diploma and Ph.D. Kiamal Pekmestzi received the Diploma degree in
degrees in electrical and computer engineering from electrical engineering from the National Technical
the National Technical University of Athens, Athens, University of Athens, Athens, Greece, in 1975, and
Greece, in 2005 and 2011, respectively. the Ph.D. degree in electrical engineering from the
He was a Post-Doctoral Research Fellow with University of Patras, Patras, Greece, in 1981.
the Dipartimento di Elettronica, Informazione e He was a Research Fellow with the Electronics
Bioingegneria, Politecnico di Milano, Milan, Italy, Department, Nuclear Research Center Demokritos,
for two years. He is currently a Research Associate Athens, from 1975 to 1981. From 1983 to 1985,
with the National Technical University of Athens. he was a Professor with the Higher School of
He has authored over 60 technical and research Electronics, Athens. Since 1985, he has been with
papers in scientific books, international journals, and the National Technical University of Athens, where
conferences. His current research interests include design space exploration for he is currently a Professor with the Department of Electrical and Computer
system level and datapath synthesis, and design and optimization of arithmetic Engineering. His current research interests include efficient implementation
VLSI circuits and power management multi/many-core and reconfigurable of arithmetic operations, design of embedded and microprocessor-based
architectures. systems, architectures for reconfigurable computing, VLSI implementation of
Dr. Xydis was a recipient of the two best paper awards from the cryptography, and digital signal processing algorithms.
NASA/ESA/IEEE International Conference on Adaptive Hardware and Sys-
tems and the Fourth Workshop on Parallel Programming and Run-Time
Management Techniques for Many-Core Architectures, in 2007 and 2013,
respectively.

Dimitrios Soudris received the Diploma and Ph.D.


degrees in electrical engineering from the Univer-
sity of Patras, Patras, Greece, in 1987 and 1992,
respectively.
He was a Professor with the Department of Elec-
trical and Computer Engineering, Democritus Uni-
versity of Thrace, Xanthi, Greece, from 1995 to
2008. He is currently an Associate Professor with
the School of Electrical and Computer Engineering,
Department of Computer Science, National Techni-
cal University of Athens, Athens, Greece. He has
authored over 340 papers in international journals and conferences. He has
also co-authored and co-edited seven books for Kluwer and Springer. His
current research interests include embedded systems design, reconfigurable
architectures, reliability, and low-power VLSI design.
Prof. Soudris received an award from INTEL and IBM for the EU project
LPGD 25256, and awards in ASP-DAC 2005 and VLSI 2005 for the
EU AMDREL project IST-2001-34379. He is the Leader and a Principal
Investigator in numerous research projects. He served as the General Chair
and Program Chair of PATMOS in 1999 and 2000, respectively, the General
Chair of IFIP VLSI-SoC in 2008, and the General Co-Chair of the Workshop
on Parallel Programming and Run-Time Management Techniques for Many-
Core Architectures in 2013.

Anda mungkin juga menyukai