Abstract Approximate computing has received significant truncation [4], [5], voltage overscaling (VOS) [2], [6], and
attention as a promising strategy to decrease power consump- simplification of logic complexity (i.e., alteration of the truth
tion of inherently error tolerant applications. In this paper, table) [7][9]. Extensive research has been conducted on
we focus on hardware-level approximation by introducing the
partial product perforation technique for designing approximate approximate adders [6], [7], [10], [11], providing significant
multiplication circuits. We prove in a mathematically rigor- gains in terms of area and power while exposing small
ous manner that in partial product perforation, the imposed error. However, research activities on approximate multipli-
errors are bounded and predictable, depending only on the ers are limited. Efficient approximate multipliers introduced
input distribution. Through extensive experimental evaluation, in [8], [9], [12], and [13] target the approximation of the partial
we apply the partial product perforation method on different
multiplier architectures and expose the optimal architecture product accumulation but do not examine approximations on
perforation configuration pairs for different error constraints. the partial product generation.
We show that, compared with the respective exact design, the Approximate hardware circuits, contrary to software
partial product perforation delivers reductions of up to 50% approximations, offer transistors reduction, lower dynamic
in power consumption, 45% in area, and 35% in critical delay. and leakage power, lower circuit delay, and opportunity for
In addition, the product perforation method is compared with the
state-of-the-art approximation techniques, i.e., truncation, voltage downsizing. Motivated by the limited research on approximate
overscaling, and logic approximation, showing that it outperforms multipliers, compared with the extensive research on approxi-
them in terms of power dissipation and error. mate adders, and explicitly the lack of approximate techniques
Index Terms Approximate arithmetic circuits, approximate targeting the partial product generation, we introduce the
computing, approximate multiplier, error analysis, low power. partial product perforation method for creating approximate
multipliers. Inspired from [14], we omit the generation of some
partial products, thus reducing the number of partial products
I. I NTRODUCTION that have to be accumulated, we decrease the area, power, and
depth of the accumulation tree. The major contributions of this
I N MODERN embedded electronic devices, power con-
sumption is a first-class design concern. Considering that a
large number of application domains are inherently tolerant to
paper are summarized as follows.
1) We adopt and apply, for the first time, the software-based
imprecise calculations, e.g., digital signal processing (DSP), perforation technique [14] on the design of hardware
data analytics, and data-mining [1], approximate computing circuits, obtaining the optimized design solutions regard-
appear as a promising solution to reduce their power dissi- ing the powerareaerror tradeoffs.
pation. Such applications process large redundant data sets 2) We analyze in a mathematically rigorous manner the
or noisy input data derived from the real world, do not have arithmetic accuracy of partial product perforation and
a golden result, perform statistical/probabilistic computations, prove that it delivers a bounded and predictable output
and/or demand human interaction, thus their exactness is error. Our error analysis is not bound to a specific
relaxed due to limited human perception [2], [3]. Approximate multiplier architecture and can be applied with error
computing can be applied at both software and hardware guarantees to every multiplication circuit regardless of
levels. its architecture. Such a rigorous analysis enables precise
Hardware-level approximation mainly targets arithmetic error estimation over input data distributions.
units, such as adders and multipliers, widely used in portable 3) We explore and characterize the efficiency of the
devices to implement multimedia algorithms, e.g., image product perforation method on several multiplier
and video processing. The most commonly used techniques schemes, exposing its powerarea impact on differ-
for the generation of approximate arithmetic circuits are ent architectures. This is the first time that such
an exploratory analysis over different approximate
Manuscript received September 17, 2015; revised January 4, 2016; accepted multiplier architectures is offered to the designer,
February 9, 2016. This work has been partially supported by the E.C. program
AEGLE under H2020 Grant Agreement No: 644906. enabling also the selection of the optimum architecture
The authors are with the Department of Electrical and Computer Engi- perforation configuration for given error constraints.
neering, National Technical University of Athens, Athens 15780, Greece 4) We show that the partial product perforation outper-
(e-mail: zervakis@microlab.ntua.gr; kostastsoumanis@gmail.com; sxydis@
microlab.ntua.gr; dsoudris@microlab.ntua.gr; pekmes@microlab.ntua.gr). forms the related state-of-the-art works in terms of
Digital Object Identifier 10.1109/TVLSI.2016.2535398 power consumption and error, as well as output quality,
1063-8210 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
when applied to image processing and data analytics the applications characteristics, and in addition, the induced
algorithms. approximation error is not rigorously bounded.
More specifically, we apply the partial product perfora- Extensive research has been conducted targeting the
tion on 16 different multiplier architectures using industrial implementation of approximate adders [7], [10], [11].
strength tools, i.e., Synopsys Design Compiler and PrimeTime. Verma et al. [11] developed a probability proof, estimating
Through extensive experimental evaluation, we present the that the longest carry chain in an n-bit adder is logn, and
optimal approximate multiplier configurations for various error produced a fast inexact adder limiting the carry propagation.
constraints. We show that, compared with the accurate multi- In [10], approximation is performed by decomposing the
plier, the product perforation offers reductions of up to 50% addition circuit in an accurate and an approximate inaccurate
in power consumption, 45% in area, and 35% in critical part. Gupta et al. [7] build imprecise full adder cells, requiring
delay for 0.1% normalized mean error distance (NMED) [15]. fewer transistors, by approximating their logic function and
Moreover, it is compared with the state-of-the-art approximate then use them to build imprecise adders. Although it is
computing works that use either VOS [6], logic approxima- proposed to use such adders targeting to build approximate
tion [9], or truncation [4], outperforming them significantly multipliers, it is not clear how they can be used in different
in terms of power dissipation and error. Finally, we examine tree architectures and how their error scales in the case
the scalability of our technique by applying it on different of multioperand addition. Targeting the creation of approx-
bit-width multipliers and show that the delivered savings imate multipliers, Kulkarni et al. [8] proposed a simplified
increase with the width increase. imprecise 2 2 multiplier cell used as the basic block for
The rest of this paper is organized as follows. In Section II, constructing larger multiplier architectures. Momeni et al. [9]
we discuss the related literature with an emphasis on circuit- presented two approximate 4:2 compressors by modifying
level approximation. Section III introduces the partial product the respective accurate truth table, which were then used
perforation technique, providing the corresponding error to build two approximate multipliers outperforming [8]. The
analysis and error correction methods. In Section IV, we approximate compressors of [9] are used in Dadda tree with
examine the product perforation on different multiplier archi- 4:2 reduction. However, different multiplier architectures were
tectures, exposing the optimal architectureperforation con- not explored. Based on an approximate adder that limits the
figuration pairs under differing error constraints. Section V carry propagation, Liu et al. [13] presented a fast and low-
evaluates the product perforation method by comparing it with power multiplier scheme with higher error than [9]. However,
the related state-of-the-art works. Finally, the conclusion is in all the aforementioned approaches, the imposed error cannot
drawn in Section VI. be predicted, as it depends on carry propagation and the
circuits implementation, and requires simulations over all
possible inputs in order to be calculated.
II. R ELATED W ORK
Recently, Narayanamoorthy et al. [17] and
In this section, the related research in the field of hard- Hashemi et al. [18] proposed the use of m m multipliers
ware approximate computing is discussed. Both general- to perform an n n multiplication (with m < n).
purpose approximation techniques [4], [6], [16] applied to any Narayanamoorthy et al. [17] statically split the multiplicand in
arithmetic circuit and circuit-specific approximation either to three m-bit segments and perform the multiplication utilizing
adder [7], [10], [11] or multiplier designs [8], [9], [13], the segment containing the most significant 1 (leading one).
[17], [18] have been presented. However, as stated in [18], m needs to be at least n/2 to attain
Regarding the general approximation techniques, acceptable accuracy, thus limiting the energy savings and the
VOS [2], [6] and truncation [4], [5], [12] have been scalability of this approach. Hashemi et al. [18] extended
proposed. VOS is applied in any circuit by lowering the the idea of leading-one segments to enable dynamic range
supply voltage below its nominal value. Decreasing the multiplication and added a correction term. Although [18]
supply voltage reduces the circuits power consumption, delivers higher accuracy designs than [17] using smaller
but produces errors caused by the number of paths that values for m, its approach requires the allocation of extra
fail to meet the delay constraints [2]. Banescu et al. [12] complex circuitry, i.e., two leading-one detectors, two
proposed an automated generation of large precision floating- complex multiplexers for segment selection, one log(n)-bit
point multipliers in field-programmable gate arrays using comparator, a log(n)-bit adder, and one 2n-bit barrel shifter.
sophisticated truncation over underutilized DSPs. In [5], These extra components are expected to highly increase
a truncated multiplier with a constant correction term is the circuits complexity, introducing nontrivial delay, area,
proposed, significantly decreasing the error imposed by typical and energy overheads that may considerably decrease the
truncation. King and Swartzlander [4] proposed a truncated approximation benefits [17]. This is expected to be more
multiplier with variable correction that outperforms [5] in evident in designs targeting too small error values, in which
terms of error. Probabilistic pruning and logic minimization the need for larger m values is required.
techniques have been presented in [16] using a greedy In this paper, we target the design of powererror efficient
approach to generate approximate circuits. These techniques multiplication circuits. We differ from the previous works
systematically eliminate circuits components and simplify by exploring approximation on the generation of the partial
logic complexity according to the circuits activity profile and products. The proposed method can be easily applied in any
output significance. Both the techniques heavily depend on multiplier architecture without the need for a special design,
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 1. Partial product reduction process for 8 8 multiplication with (a) accurate array, (b) approximate array, (c) accurate Wallace, (d) approximate
Wallace, (e) accurate compressor 4:2, (f) approximate compressor 4:2, (g) accurate Dadda 4:2, and (h) approximate Dadda 4:2. Approximation is performed
by perforating the third and fourth partial products. The boxes with four dots are 4:2 compressors, those with three are full adders and those with two are
full- or half-adders.
in contrast to related works. In addition, the error imposed by approximate multiplication is given by
perforation depends only on the configuration parameters and,
in contrast to existing work, can be analytically calculated
n/21
A B| j,k = Ab iM B 4i , b iM B {0, 1, 2}. (3)
without the need for exhaustive simulations. The latter is
i=0
critical, as, given the applications inputs, a precise estimation i [
/ j, j +k)
of the output quality can be extracted. Finally, the knowledge Fig. 1 shows an example of applying the partial product
of the induced error permits the selection of the configuration perforation method on different 8-bit multipliers with j = 2
that maximizes the power savings for a specific error bound. and k = 2 configuration values. For each architecture, the dot
diagrams [19] of the accurate and the respective perforated tree
III. A NALYZING PARTIAL P RODUCT P ERFORATION are presented. The dots represent the bits of the partial prod-
ucts that have to be accumulated, while the stages represent
A. Method Analysis the delay of the reduction process followed by each tree. The
In this section, the partial product perforation method for dashed boxes with four dots are 4:2 compressors, those with
the design of approximate hardware multipliers is described. three are full adders and those with two are either full- or
Consider two n-bit numbers A and B. The result of their half-adders. Through the proposed approximation technique,
multiplication A B is obtained after summing all the partial the power, area, and delay of the multiplication circuit are
products Abi , where bi is the i th bit of B. Thus decreased, making, though, the computation imprecise. The
higher the order of a perforated partial product, the greater the
n1 error imposed at the final result. In addition, since the addition
AB = Abi 2i , bi {0, 1}. (1) is an associative and commutative operation, when more than
i=0 one partial products are perforated, the total error results from
The partial product perforation technique omits the genera- the addition of the errors produced from the perforation of
tion of k successive partial products starting from the j th one. each partial product separately.
A perforated partial product is not inserted in the accumulation We use the notation D[j,k,c] to label the different approxi-
tree, and hence n full adders can be eliminated. Applying the mate multiplier architectural configurations. The parameter D
product perforation with j and k configuration values on the refers to the tree architecture, j is the order of the first per-
multiplication, A B produces the approximate result forated partial product, and k is the number of the perforated
partial products. If no j and k are specified, the respective
n1
notation refers to the exact design. Finally, c corresponds to
A B| j,k = Abi 2i , bi {0, 1}. (2) the partial product generation technique and takes the value s
i=0, for simple partial products (SPPs) or m for MBE. For example,
i [
/ j, j +k)
Fig. 1(a) shows the array[s] configuration, while Fig. 1(b)
Note that j [0, n 1] and k [1, min(n j, n 1)]. shows the array[2,2,s] configuration.
Similarly, when modified booth encoding (MBE) [19] is The partial product perforation should not be confused
used for generating the partial products, the result of the with the truncation technique. Truncation eliminates the circuit
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
that produces specific least significant bits (LSBs) of the Assuming that ED A is the sum of EDs B for a given A,
accumulation tree, while the perforation skips the generation we have
of partial products and thus decreases the number of operands
ED A = ED(A, B) = 2nk xB2 j A
to be accumulated. For example, in an 8-bit array multiplier,
B x B
perforating a partial product removes eight full adders from
2 2 (2k
n j 1)A
the accumulation tree and reduces its delay. In order to attain = (8)
similar circuit reduction using truncation, 6 LSB have to be 2
truncated. However, truncating 6 LSB does not offer any delay and the sum of all EDs is
n
2n 2 j (2k 1)A 2 1
reduction. Moreover, in this example, the truncation delivers, 2n 2 j (2k 1)
in all the cases, incorrect results, whereas the outputs of perfo- ED A = = A
2 2
ration are 50% correct. Finally, perforating one partial product A A A=0
(out of eight) results in a 12.5% loss of information while 2 j 22n (2k 1)(2n 1)
truncating 6 LSB (out of 16) results in a 37.5% information = .
4
loss. In Section V, the perforation and truncation techniques (9)
are quantitatively compared in greater detail regarding error
and power metrics, in order to further expose their differences. Using (9), (7) equals
2 j 22n (2k 1)(2n 1) 2 j (2k 1)(2n 1)
MED = = .
B. Error Analysis 22n 4 4
A critical issue for the approximate computing is the error (10)
imposed during computations and how it affects the final Thus
result. In this section, an error evaluation analysis of the partial MED 2 j (2k 1)
product perforation technique is presented. We evaluate the NMED = = . (11)
(2n 1)2 4(2n 1)
induced error metrics proposed in [15], i.e., ED, MED, and
NMED, as effective metrics for quantifying the accuracy of Similarly
approximate arithmetic circuits. ED is defined as the absolute ED(A, B) xB2 j
RED(A, B) = = (12)
distance of the fully accurate product P and the approximate AB B
one P , ED = |P P |. The MED is the average of EDs for and
all inputs and NMED = MED/Pmax , where Pmax = (2n 1)2
2n x B 2 j xB2 j
in the case of an n-bit multiplier [13]. The relative error MRED = 2n
= . (13)
distance (RED) is defined as RED = ED/P, and the mean 2 B 2n B
B B
RED (MRED) is similarly obtained [13]. The previous analysis provides rigorous expressions of error
1) Error Evaluation: When applying the product perfora- metrics, enabling a fast error analysis of differing product
tion on an n-bit multiplier using SPP generation, the ED of perforation configurations. As shown in Section IV, these
multiplying two numbers A and B is calculated as follows: analytical error expressions are used in an exploration loop for
n1
n1 deriving optimized approximate design solutions. The analyti-
ED(A, B) = |P P | = A bi 2i A bi 2i cal equations (11) and (13) consider uniform distribution; thus
i=0 i=0, in the case of differing distributions,1 they should be adjusted
i [
/ j, j +k)
according to the new PDFs, since the powererror efficiency
j +k1
of approximate designs highly depends on the multipliers
= A 2i bi = A2 j x B (4) operands distribution. In most applications, e.g., multimedia,
i= j the inputs are highly correlated [16]. As an intuitive example,
where x B [0, 2k ) and Fig. 2(a) shows the powerNMED Pareto graph for a 16-bit
Dadda 4:2 multiplier when A and B follow the uniform
k1
distribution over the overall range of n-bit numbers, while
xB = 2i b j +i = B/2 j mod 2k . (5)
Fig. 2(b) shows the same graph with inputs derived from the
i=0
GSM 06.10 audio benchmark [20]. As shown, increasing the
If p A and p B are the probability density functions (PDFs) of k-values results in lower power consumption but increased
A and B, respectively, then the MED is calculated from error values, while the selection of the j-value mostly depends
MED = p A (A) p B (B)ED(A, B). (6) on the input distribution. Intuitively, for a uniform distribution
A,B over all possible n-bit numbers [Fig. 2(a)], where all the
bits have equal probability of being one or zero, j should
Without loss of generality, the rest of our analysis consid-
be kept small to minimize the error. This is also confirmed
ers a uniform distribution over the overall n-bit numbers,
from Fig. 2(a), where 58% of the Pareto configurations feature
i.e., (A, B) [0, 2n )2 . Hence, p A (A) = 1/2n A and
j = 0 and 42% of the Pareto configurations feature j = 1.
p B (B) = 1/2n B. Therefore, MED is given from
However, as shown in Fig. 2(b), when the inputs are correlated
ED(A, B) 1
MED = = ED(A, B). (7) 1 In the case of different input distributions, starting from (6), we apply the
2n 2n 22n
A,B A B same steps given the respective PDFs of the input operands.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
= = 2nk (C I x B )
B B B
A,B: B A: B
x A >x B x A >x B
(24)
and Fig. 3. Percentage reduction of (a) NMED and (b) MRED achieved by the
xB xB xB
correction Methods 1 and 2 with respect to the NMED and MRED values
= = 2nk (25) obtained by product perforation without correction. The x-axis contains all
B B B the [j, k] configurations.
A,B: B A: B
x A =x B x A =x B
(23) is equal to the product perforation configurations (j, k). Fig. 3(a) shows
xB
the NMED reduction attained by the correction methods with
RED(A, B) = 2 j 2nk (1 + 2(C I x B )) respect to the NMED of product perforation without an error
B
A,B B
correction method. Fig. 3(b) shows the respective graph for
n 1
2 xB
the MRED metric. The proposed corrective methods offer both
= 2 j 2nk (1 + 2(C I x B )) NMED and MRED reduction. Method 1 offers higher NMED
B
B=1 reduction, while Method 2 achieves higher MRED reduction.
(26)
On average, Method 1 offers 30% NMED reduction and
and MRED is calculated as a relation of j and k from 24% MRED reduction, while Method 2 offers 26% reduction
and 50% reduction, respectively. As a result, the selection of
2j xB
2 1 n
a corrective method depends on the application in which the
MRED = n+k
(1 + 2(C I x B )) . (27) perforated multiplier will be used. If the magnitude of the
2 B
B=1
error is more important than its absolute distance from the
Method 2 (Comparing A and B): In this method, A and B accurate result, then Method 2 should be preferred; if not, then
are compared before the multiplication, and if A > B, A and B Method 1 should be selected. However, the implementation
are swapped. As a result the induced error ED(A, B) = of Method 1 requires a k-bit comparator, while Method 2
A2 j x B , when A B and ED(A, B) = B2 j x A , when A > B. requires an n-bit one, and thus Method 1 induces smaller area
Similar to Method 1 and power overheads. As a result, since both the methods
offer significant NMED and MRED reductions and Method
2j 1 induces less power overhead, it should be preferred in the
MED = x A + x A B
22n
B case the application is unknown.
A,B: A,B: Methods 1 and 2 decrease the error metrics, but their imple-
AB A>B
mentation requires an additional comparator. Fig. 4 shows the
2j impact of correction Method 1 or Method 2 on the delay,
= xA A + 2 x A B power, and area on the Dadda 4:2 multiplier, with respect to
22n
A,B: A B: the accurate design. Since the complexity of the comparator is
A=B B<A
n 1
2 mainly affected by the perforation variable k, Fig. 4 shows the
2 j
= x A A2 (28) perforation configurations that feature j = 1 and k = 1 to 8
22n (similar results are obtained for other j and for MBE designs).
A=1
n 1 As expected, using Method 1 with perforation induces 13%
2 j 2A=1 x A A2
NMED = (29) overhead on critical delay, but also retains 26% and 20%,
22n (2n 1)2 on average, power saving and area saving, respectively. The
2j xB
2n 1
respective values for Method 2 are 20%, 26%, and 17%.
and MRED = 2n + 2x B . (30) The NMED and MRED analytical relations show that the
2 B
B=1 error imposed by the product perforation method is bounded
Fig. 3 shows the error improvement achieved by and predictable. Therefore, when the applications input data
Methods 1 and 2, for a 16-bit (n = 16) multiplier and all set is determined, it can be used to calculate the optimal
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 10. (a) 16-bit input image and the result of the geometric mean filter using (b) accurate multiplier Dadda4:2[s], (c) Dadda4:2[1,5,s] without correction,
(d) Dadda4:2[1,5,s] with correction Method 1, (e) Dadda4:2[3,4,s] without correction, (f) Dadda4:2[3,4,s] with correction Method 1, and (g) ACM2.
Fig. 11. (a) 16-bit input image and the result of the Canny edge detection using (b) accurate multiplier Dadda4:2[s], (c) Dadda4:2[1,5,s] without correction,
(d) Dadda4:2[1,5,s] with correction Method 1, (e) Dadda4:2[3,4,s] without correction, (f) Dadda4:2[3,4,s] with correction Method 1, and (g) ACM2.
algorithms are implemented in C++, while for the image The input data set is clustered in 100 clusters. To evaluate
processing ones, OpenCV library is used. the accuracy of the K-means algorithm, we use the average
Geometric mean filter removes noise from images, offering relative L2-norm, i.e., (|xacc xapprox|2 /|xacc |2 )
.
better results than the arithmetic mean filter for Gaussian-type Similar to [9] and [10], the approximate multiplier is consid-
noise. The geometric mean filter with parameter r filters an ered as part of a general processing system that implements the
image by replacing each pixels value by the geometric mean aforementioned algorithms. The rest of hardware components
of the values of all the neighboring pixels that are inside (except the multiplier) are considered to deliver accurate
a (2r + 1) (2r + 1) block centered on that pixel. For our results, and thus any applications inaccuracy and energy sav-
evaluation, the r parameter is set to 3. We approximate the geo- ings result from the usage of the approximate multiplier. The
metric mean by replacing the multiplication between the pixels energy values of each multiplication operation are delivered
with an approximate 16 16 multiplier. We used as input the by postsynthesis simulations of the approximate multipliers on
16-bit (16 bits/pixel) grayscale image, as shown in Fig. 10(a). the input data traces extracted by the applications execution.
To evaluate the accuracy of the output images of the geometric Note that in the Canny edge detection and geometric mean
mean, we use the peak signalnoise ratio (PSNR). algorithms, the number of the multiplications depends only
Canny edge detection [27] filter is considered to be an on the image size, and thus it is the same for the accurate as
optimal edge detector. In particular, it masks the image by well as the approximate version of the algorithm. On the other
applying a Gaussian filter to remove the noise, it calculates hand, the iterations performed by the K-means algorithm are
the gradient of the image to find the edge strength, it applies not constant, and as a result, the number of multiplications
a nonmaximum suppression to keep only the local maxima, it in the accurate may differ from the ones in the approximate
determines the potential edges by thresholding, and it tracks version.
edges by hysteresis, i.e, suppresses all the edges that are weak Fig. 10 shows both the input image and the output image of
and not connected to strong edges. The size of the Gaussian the geometric mean filter when using the accurate multiplier
kernel is 7 7 with 1.1 standard deviation value and uses Dadda4:2[s], the perforated multipliers Dadda4:2[1,5,s] and
16-bit fixed point arithmetic. We approximate Canny edge Dadda4:2[3,4,s] with and without any correction method and
by replacing the multiplication in the Gaussian filter with an the approximate multiplier ACM2. Fig. 11 shows the same
approximate 16 16 multiplier. We used as input the 16-bits images for the Canny edge detection. Table II summarizes
grayscale image, shown in Fig. 11(a). The percentage of the the values of the energy savings and quality metrics of each
edges detected using the approximate multiplier over those application when using the aforementioned multipliers.
detected using the accurate one is used as our quality metric. The use of the Dadda4:2[1,5,s] multiplier results in
K-means is a popular algorithm for clustering data points 85.95-dB PSNR for the geometric mean and 91.04% edges
from a multidimensional space into k clusters. It uses a detected for the Canny edge detection. The application of the
two-phase iterative method and aims to partition the data corrective Method 1 with the Dadda4:2[1,5,s] results in a small
points into sets, so as to minimize the within-cluster sum decrease of the energy savings (7.41%), but delivers better
of distance functions of each point in the cluster to the outputs as the PSNR increases by 2.9% and the edges detected
center. We use the Euclidean distance as a distance function. by 7.6%. The Dadda4:2[3,4,s] multiplier detects the 84.79%
We approximate the K-means algorithm by replacing the mul- of the edges, and its PSNR is 89.93 dB. The use of correction
tiplications in the calculation of the Euclidean distance with Method 1 with the Dadda4:2[3,4,s] decreases the energy
an approximate 1616 multiplier. We use a random generated reduction by 10%, detects 16.6% more edges, and increases
input data set of 100 000 4-D points with 16 bits/dimension. its PSNR by 3.1%. When ACM2[s] [9] is used, the output
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE II
E VALUATION OF PARTIAL P RODUCT P ERFORATION IN I MAGE P ROCESSING AND D ATA A NALYTICS A LGORITHMS
Sotirios Xydis received the Diploma and Ph.D. Kiamal Pekmestzi received the Diploma degree in
degrees in electrical and computer engineering from electrical engineering from the National Technical
the National Technical University of Athens, Athens, University of Athens, Athens, Greece, in 1975, and
Greece, in 2005 and 2011, respectively. the Ph.D. degree in electrical engineering from the
He was a Post-Doctoral Research Fellow with University of Patras, Patras, Greece, in 1981.
the Dipartimento di Elettronica, Informazione e He was a Research Fellow with the Electronics
Bioingegneria, Politecnico di Milano, Milan, Italy, Department, Nuclear Research Center Demokritos,
for two years. He is currently a Research Associate Athens, from 1975 to 1981. From 1983 to 1985,
with the National Technical University of Athens. he was a Professor with the Higher School of
He has authored over 60 technical and research Electronics, Athens. Since 1985, he has been with
papers in scientific books, international journals, and the National Technical University of Athens, where
conferences. His current research interests include design space exploration for he is currently a Professor with the Department of Electrical and Computer
system level and datapath synthesis, and design and optimization of arithmetic Engineering. His current research interests include efficient implementation
VLSI circuits and power management multi/many-core and reconfigurable of arithmetic operations, design of embedded and microprocessor-based
architectures. systems, architectures for reconfigurable computing, VLSI implementation of
Dr. Xydis was a recipient of the two best paper awards from the cryptography, and digital signal processing algorithms.
NASA/ESA/IEEE International Conference on Adaptive Hardware and Sys-
tems and the Fourth Workshop on Parallel Programming and Run-Time
Management Techniques for Many-Core Architectures, in 2007 and 2013,
respectively.