Newww

Vedic division methodology for high-speed very large scale integration applications
Prabir Saha1, Deepak Kumar2, Partha Bhattacharyya3, Anup Dandapat1

1
Department of Electronics and Communication Engineering, National Institute of Technology, Shillong,
Meghalaya 793 003, India
2
Department of Computer Science and Engineering, National Institute of Technology, Shillong, Meghalaya 793 003, India
3
Department of Electronics and Telecommunication Engineering, Bengal Engineering and Science University, Shibpur,
Howrah 711 103, India
E-mail: anup.dandapat@gmail.com
Published in The Journal of Engineering; Received on 19th November 2013; Accepted on 7th January 2014
Abstract: Transistor level implementation of division methodology using ancient Vedic mathematics is reported in this Letter. The potentiality
of the Dhvajanka (on top of the ag) formula was adopted from Vedic mathematics to implement such type of divider for practical very large
scale integration applications. The division methodology was implemented through half of the divisor bit instead of the actual divisor, sub-
traction and little multiplication. Propagation delay and dynamic power consumption of divider circuitry were minimised signicantly by stage
reduction through Vedic division methodology. The functionality of the division algorithm was checked and performance parameters like
propagation delay and dynamic power consumption were calculated through spice spectre with 90 nm complementary metal oxide semicon-
ductor technology. The propagation delay of the resulted (32 16) bit divider circuitry was only 300 ns and consumed 32.5 mW power for
a layout area of 17.39 mm2. Combination of Boolean arithmetic along with ancient Vedic mathematics, substantial amount of iterations were
reduced resulted as 47, 38, 34% reduction in delay and 34, 21, 18% reduction in power were investigated compared with the mostly
used (e.g. digit-recurrence, NewtonRaphson, Goldschmidt) architectures.
1 Introduction
thereby it cannot be optimised like a parallel multiplier [13]. The
Division is a fundamental operation in many scientic and engin- drawback of these methods is operands should be previously nor-
eering applications, like arithmetic computation, signal processing, malised, most used primitive are multiplications and the remainder
articial intelligence, computer graphics etc. [13]. Generally, com- is not directly obtained.
putations of such division operations are calculated in sequential In algorithmic and structural levels, substantial amount of div-
manner, thereby costlier in terms of propagation delay (latency) ision techniques has so far been developed to reduce the propaga-
compared with other mathematical operations like addition, subtraction delay and power consumption of the divider circuitry; by
tion and multiplication [4]. reducing the iteration, aiming towards high-speed operations,
Substantial amount of works have so far been investigated by but principle behind division techniques are same in all cases.
various researchers to implement the high-speed divider [115] Vedic mathematics [16] is the ancient system of mathematics
like digit recurrence (DR) methodology (restoring [1, 3, 5], non- which has unique computation techniques based on 16 sutras (for-
restoring [2, 6, 9]), division by convergence (NewtonRaphson mulae). Recently, we [17] reported on a Vedic divider based on
(NR) method [1012]), division by series expansion Nikhilam Navatascaramam Dasatah for some specic number
(Goldschmidt (GS) algorithm [13, 14]) etc. Generally, division system, like, the divisor was chosen very close to the base of
architectures can be classied into two categories: namely (i) iter- operations. The implementation reduces the number of iterations,
ation based and (ii) multiplication based. Iterative divisions if the divisor is closer to the base of operation, otherwise
consist of shift-and-subtract operations, generates one quotient increases the iterations, a serious bottleneck of the algorithm. In
bits, in each of the iterations, like radix-2 restoring and non- this Letter, we report on a division technique and its transistor
restoring division. Thereby, in iterative division, after each subtrac- level implementation of such circuitry based on such ancient
tion cycle, it should require to check whether the resulting remain- mathematics. Dhvajanka is a Sanskrit term indicating on top of
der is lesser than the divisor or negative. The cost in terms of the ag, is adopted from Vedas; formula is encountered to imple-
computational complexity of DR algorithms [13, 5, 6, 9] is low ment the division circuitry. In this approach, divider implementa-
because of the large number of iterations; therefore latency tion was transformed into just small division instead of actual
becomes high. Although, some of the researcher rely on higher divisor, subtraction and few multiplication, thereby reduces the
radix implementation of DR algorithm [6, 7, 10] to reduce the itera- iterations, owing to the substantial reduction in propagation
tions, therefore the latency becomes improved from earlier reports delay. Transistor level (application specic integrated circuit
[13, 5, 9], but these schemes additionally increases the hardware (ASIC)) implementation of such division circuitry was carried out
complexity. Some other attractive ideas are based on functional by the combination of Boolean arithmetic with Vedic mathematics,
iterations, like NR [1012] and GS [1315] algorithm, utilises performance parameters like propagation delay, dynamic switching
multiplication techniques along-with the series expansion, where power consumption calculation of the proposed method was cal-
the amount of quotient bits obtained in each of the iterations is culated by using spice spectre in 90 nm complementary metal
doubled. These methods converge quadratically towards the quo- oxide semiconductor (CMOS) technology and compared with
tient when the number of iterations is increased, thereby latency other designs like DR- [9], NR- [11], and GS [15]-based imple-
becomes high. Each iterations of NR and GS methods involve mentation. The calculated results revealed (32 16) bit divider cir-
two dependent multiplications; namely, the product of the rst cuitry has propagation delay 300 ns with 32.53 mW dynamic
multiplication is one of the operands of the second multiplication switching power for a layout area of 17.39 mm2.
J Eng 2014 This is an open access article published by the IET under the Creative Commons
doi: 10.1049/joe.2013.0213 Attribution License (http://creativecommons.org/licenses/by/3.0/)
1
Fig. 1 Illustration of Dhvajanka sutra
a Small divisor with exact division (remainder 0)
b Large divisor (remainder 0), have been considered for illustration purpose
2 Vedic division methodology 2.1 Numerical example of Dhvajanka sutra

The gifts of the ancient Indian mathematics in the world history of With the help of example, shown in Fig. 1a, dividend has been con-
mathematical science are not well recognised. The contributions of sidered as 38 982 (ve digit number) and divisor is equals to 73
mathematician in the eld of number theory, Sri Bharati Krsna (two digit number). Out of divisor 73, we put down only the rst
Thirthaji Maharaja, in the form of Vedic sutras (formulae) [16] are digit (i.e. 7) in the divisor column and put the other digit (i.e. 3)
signicant for calculations. He had explored the mathematical poten- on top of the ag. On the other hand, shown in Fig. 1b, dividend
tials from Vedic primers and showed that the mathematical operations has been considered as 135 791 and divisor has been considered as
can be carried out mentally to produce fast answers using the sutras 1632. The entire division for Fig. 1a is to be set by 7; and for Fig. 1b
(formulae). In this Letter, we report only Dhvajanka formula to is to be set by 16. The diagram implementation procedure has been
implement the division algorithm and its architecture. described in Table 1.
Table 1 Chart implementation procedure, the example has been considered from Fig. 1
Implementation steps of Fig. 1a Implementation steps of Fig. 1b
1. One digit of divisor has been put on top; we allot one place (at the right 1. Two digits have been put on top; we allot two places (at the right end of
end of the dividend) to the remainder portion of the answer and mark it the dividend) to the oating point portion of the answer and mark it off from
off from the digit by a vertical line the digit by a vertical line
2. 38 is divided by the most signicant digit (MSD) of the divisor (i.e. 7). 2. 135 is divided by the MSD part of the divisor (i.e. 16). Quotient is 8 and
Quotient is 5 and remainder is 3. This remainder will be used for next step remainder is 7. This remainder will be used for next step division
division
3. In this step, our actual gross dividend is 39 is subtracted by the value 3. In this step, our actual gross dividend is 77 is subtracted by the value
obtained by multiplying previous quotient (i.e. 5) with least signicant obtained by multiplying previous quotient (i.e. 8) with LSD part of divisor
digit (LSD) of divisor (i.e. 3). {39 (5 3) = 24}. After subtraction, the (i.e. 3). {77 (8 3) = 53}. After subtraction, the result is again divided by
result is again divided by 7. Quotient becomes 3 and remainder becomes 3 16. Quotient becomes 3 and remainder becomes 5
4. In the third stage, the gross dividend is 38. Again it is subtracted by 9 4. In the third stage, the gross dividend is 59. Again it is subtracted by
(3 3) similar to the previous step. Thus, the result (38 9 = 29) is cross-multiplication of LSD part of the divisor and the obtained quotient,
divided by 7. Quotient is 4 and remainder is 1 that is, (59 (3 3 + 8 2) = 34). Thus, the result is divided by 16. Quotient
is 2 and remainder is 2
5. This is the nal stage for this example. Here, the nal remainder is 5. In the fourth stage gross dividend is 21, again it is subtracted by the
calculated. Actual gross dividend is subtracted by 12 and the result in the cross-multiplication of the digits of the quotients and LSD parts of the
nal remainders divisor and results become 9. This result is divided again by 16, quotient
becomes 0 and remainder becomes 9
6. Thus, we say quotient is equals to 534 and remainder is equals to 0 6. The process continues until the number of iterations
7. Thus, the results become 83.205
This is an open access article published by the IET under the Creative Commons J Eng 2014
Attribution License (http://creativecommons.org/licenses/by/3.0/) doi: 10.1049/joe.2013.0213
2
quotient and 21x 2 + 9x there from and then obtain 3x 2x as
the remainder.
3. However, this 3x 2 is equals to 30x which (with x + 2) gives us
29x + 12 as the last step dividend. Again multiplying the divisor by
4, we obtain the product 28x + 12; and subtract this 28x + 12,
thereby obtaining x 10 as the remainder. However, x is being
10, thus the remainder vanishes.
2.3 Mathematical modelling of Dhvajanka sutra

Let us assume the numbers A = n1 i
i=0 ai x is dividend, and
m1
Fig. 2 Algebraical proof of the formula B = i=0 bi x is divisor, where x is the radix of the number. So
i
A can be expressed in terms of B as

2.2 Algebraic proof of Dhvajanka sutra
A = an1 xn1 + an2 xn2 + an3 xn3
Algebraic proof of the formula is shown in Fig. 2, where x stands (1)
for 10. To understand the steps taken from Fig. 1a; by means of + an4 xn4 + + a3 x3 + a2 x2 + a1 x1 + a0
which 38 982 is sought to be divided by 73. Algebraically, the divi- (see (2) and (3))
dend is represented as 38x 3 + 9x 2 + 8x + 2; and the divisor is 7x + 3.
Now, let us proceed with the division in the usual manner.
2.4 Illustration of Dhvajanka sutra
1. If we try to divide 38x by 7x, our rst quotient digit is
3
Consider dividend f (x) = a3x 3 + a2x 2 + a1x + a0 and divisor g(x) =
5x 2. In the rst step of the multiplication of the divisor by b1x + b0, where x is radix. We have to compute f(x)/g(x) with
5x 2, we obtain the product 35x 3 + 15x 2 and this gives us the the help of on top of the ag sutra. Mathematically, f (x)/g(x) =
remainder 3x 3 + 9x 2 15x 2. Which is actually 30x 2 + 9x 2 (a3x 3 + a2x 2 + a1x + a0)/(b1x + b0) can be represented as (see (4)
15x 2 = 24x 2. and (5))
2. The rst step remainder term (i.e. 24x 2) plus 8x being our (see equation (6) at bottom of the next page)
second-step dividend, we multiply the divisor by second Then f(x) = Q(x)g(x) + R, where, Q(x) is quotient and R is

a a a a
= an1 xn1 + n1 bm2 xn2 + + n1 b2 x(n/2)+2 + n1 b1 x(n/2)+1 + n1 b0 x(n/2)
bm1 bm1 bm1 bm1

a a
+ an2 n1 bm2 xn2 + + a(n/2)+2 n1 b2 x(n/2)+2 (2)
bm1 bm1

a a a a2 (. . .) /bm1
+ a(n/2)+1 n1 b1 x(n/2)+1 + an/2 n1 b0 x(n/2) + + a0 1
bm1 bm1 bm1

= bm1 x(n/2)1 + bm2 x(n/2)2 + + b0
(3)
a a an1 / bm1 bm2 (n/2)1 a a1 a2 (. . .) /bm1 /bm1 b0
n1 xn/2 + n2 x + + 0
bm1 bm1 bm1

a a3
a2 3 b0 a2 b
b1 b1 0
a1 b0 a1 b0
a b1 b1
a2 3 b0
a3 2 b1
x b1 x + b0 + x b1 x + b0 + b1 x + b0 +
a0 b0

b1 b1 b1 b1

= (4)
b1 x + b0

a3 a
a2 b a2 3 b0
b1 0 b1
a1 b0 a1 b0
b1

b1

a3 2 a (a3 /b1 )b0
x + 2 x+ b1 x + b0 a0 b0
b b1 b1 b1 (5)
1

= +
b1 x + b0 b1 x + b0
3
m1
n1
i=n(m/2) ai 2
i i
remainder. Through the algebraic identity the equations can be re- Step 2: Determine i=(m/2) bi 2 . Suppose the
written as rst borrow 0, then through multiplexer it will set the quotient
(Qn) 1 and the remainder is R.
Q( x ) Step 3: Determine Qn b(m/2)1. Concatenate R and an(m/2)1 and
subtract Qn b(m/2)1. Again divide in similar procedure (step 1).
a a (a3 /b1 )b0 a a2 (a3 /b1 )b0 /b1 b0 Set the quotient bit Qn1 and remainder R.
= 3 x2 + 2 x+ 1
b1 b1 b1 Step 4: Determine Qn1 b(m/2) 2 + Qn b(m/2) 1. Concatenate R
and an(m/2)2 and subtract Qn1 b(m/2) 2 + Qn b(m/2) 1. Again
a a2 (a3 /b1 )b0 /b1 b0 divide in similar procedure (step 1). Set the quotient bit Qn1 and
and R = a0 1 b0
b1 remainder R.
2.5 Flowchart diagram of the algorithm 3.2 Latency of the divider

In this section, divider implementation algorithm has been dis- The hardware cost of the architecture can be computed based on the
cussed leading towards high-speed operation. The owchart of number of complex operations performed in its critical path, hence
the algorithm is shown in Fig. 3. Where, dividend (A) and total propagation delay can be estimated. The reported architecture
divisor (B) considered as n-bit and m-bit, respectively. The imple- for division using Vedic mathematics can be computed in ve steps
mentation procedure using the owchart diagram has been shown in Fig. 5, with maximum n (for imperfect division) itera-
described in Table 2, where two examples have been considered. tions. So the total latency can be computed in terms of the propaga-
Example 1 has been considered for perfect division (remainder = tion delay of summation the individual subsection, with n
0), Example 2 has been considered for imperfect division (remain- iterations. The total propagation delay of the proposed architecture
der 0). For simplicity purpose (8 4) bit divider example has (tpd) can be computed as
been considered, example of higher order bit can be implemented
in similar manner. tpd = tstage1 + tstage2 + tstage3 + tstage4 + tstage5 (7)
3 Divider implementation technique

where tstage1 is the propagation delay of stage1; tstage2 is the propa-
Proposed divider implementation technique is shown in Fig. 4. The gation delay of stage2; tstage3 is the propagation delay of stage3;
architecture has been implemented via (3). For simplicity purpose, tstage4 is the propagation delay of stage4; and tstage5 = propagation
let us assume dividend has greater length than divisor. Divisor has delay of stage5.
been broken into two parts, that is, most signicant part (L) and Stage 1 contains only comparator [18], and comparator has been
least signicant part (R). L is compared with equal number of bits implemented through 2 stage parallel adder and 2 stage XOR
of dividend taken from most signicant bit (MSB) side. If the gates. For m bit divisor maximum, m/2 bit comparator is required.
dividend is greater than L, directly divide the dividend bits by L, Thereby, maximum m/2 bit parallel adder is required in each case.
otherwise concatenation with next signicant bit of dividend. Critical path to implement a full adder is equal to 2 XOR gate delay;
Divide procedure has been implemented through subtractor. thereby critical path for to implement m/over2 bit parallel adder is
Difference is acting here as remainder, and borrow has been equal to (m/2) 2 XOR = mXOR gate delay. 2 stage parallel
working as the selector input of the multiplexer. If the borrow is adders and 2 XOR stage are required to implement a comparator,
equal to 0 hence quotient 1 else 0. The remainder is again thus total propagation delay equals to (2m + 2) XOR gate delay.
concatenated of next MSD of the dividend and subtracted from the Second stage contains only m/2 bit parallel subtractor, and critical
cross-multiplication result of the quotient bits and least signicant path of 1 bit subtractor equals to 3 XOR gate delay, thereby, total
bits of divisor. If result is negative, the quotient is reduced by 1 critical path delay for m/2 bit subtractor maybe estimated as (m/
and set the new quotient bits, otherwise for positive result it is 2) 3 XOR gate delay. Third stage contains only parallel adder
promoted to the next stage. Similarly, the division algorithm has of n bit, assuming one full adder may require 2 XOR gate delay,
been implemented. thereby total propagation delay of n bit parallel adder requires
Consider
m1 the number A = n1 i=0 ai 2
i
to be divided by n 2 XOR gate delay. Fourth stage contains m/2 bit multiplier,
B = i=0 bi 2 , where (ai, bi 0, 1). To execute the division oper-
i
and n bit subtractor in feedback path. Assume critical path delay
ation easily through Dhvajanka (on top of the ag) methodology, of n bit subtractor equals to 3 n XOR gate delay. To implement
it has been assumed that the length of dividend is greater than length multiplier, three stages are required, namely (i) partial product gen-
of divisor. eration, (ii) partial product addition and (iii) nal addition [18]. In
partial product generation stage, maximum depth in a column of
3.1 Implementation procedure the partial product is equal to m/2. For generation of partial
product, it requires m/2 XOR (let us assume XOR gate delay and
Step
1: Consider the most signicant part of dividend AND gate delays are equal) delays. For addition, it may
n1 i m1 i
i=n(m/2) ai 2 and divisor i=(m/2) bi 2 . require (m/(2 3)) 2 XOR gate, that is, m/3 XOR gate for

a3
a2 b
a3 b1 0
a2 b a1 b0
b1 0 b1
a1 b0
a b1
a2 3 b0 (a0 b0 )
a3 2 b1 b1 R (6)
=
b x + x+ +
= Q( x) +
1 b1 b1 b1 x + b0 g(x)

4
Fig. 3 Flowchart representation of divider using dhvajanka formula
partial product addition in rst stage. For second stage requires 4 Results and discussion
m/6 XOR gate and so on, thus total addition purpose may be
The advantages of CMOS transmission gate (TG) logic over con-
approximated as m + (m/2) = (3m/2) XOR gate delay. Also for
ventional CMOS and complementary pass transistor logic (CPL)
multiplication approximated, maximum XOR gate delay equals to
[19, 20] logic are well established. As the CMOS TG consists of
3m/2. In the fth stage, m/2 bit subtractor is required, thereby crit-
one p-channel MOSFET (PMOS) and one n-channel MOSFET
ical path delay of m/2 bit subtractor equals to (m/2) 3 XOR
(NMOS), connected in parallel, the ON resistance is smaller
gate delay.
than even a single NMOS. Proper modications at the device,
Thus, total propagation delay for each of the iterations may be
circuit and architectural levels of design hierarchy have been imple-
approximated as
mented to reduce the energy delay product (EDP) and power delay
product (PDP) for the proposed design. TGs are used for the design
of different modules for faster operation and better logic transform-
tpd = tstage1 + tstage2 + tstage3 + tstage4 + tstage5 ation. Dual threshold voltage (VT) operating mode was considered
for simulation to determine the performance parameters. The
= (2m + 2) + (3m/2) + 2n + 3n + (3m/2)
proper choice of threshold voltages for a particular transistor in
+ (3m/2) = [5n + (13m/2) + 2] the circuit is based on a number of logics as described below:
(i) Placement of high-VT transistors on the leakage path directly
XOR gate delay. Thereby n iteration may consume n(5n + (13m/ between supply and ground reduces the subthreshold leakage
2) + 2) XOR gate delay. current and hence static power.
5
Table 2 Illustration of owchart with the help of the examples. Example Table 2 Continued
1 has been considered for complete division (remainder = 0), Example 2
has been considered for incomplete division (remainder 0) Steps Example 1 Example 2
Steps Example 1 Example 2 step 9 T = T L = 101 11 = 10

Q = 101101
initialisation A = 10000100 A = 10101010 d = 100
B = 1011 B = 1111 d = 100 (1 1 + 1 0) = 11
L = 10; l = 2 L = 11; l = 2 T = d = 11
R = 11; r = 2 R = 11; r = 2 result when i = r 1, then when i = r 1, then oating
Q: = 0 Q: = 0 oating point (bit) start point (bit) start
i=7 i=7 Q = 1100.00 Q = 1011.01
step 1 T = 10 T = 10
i = (i l) = (7 2) = 5 i = (i l) = (7 2) = 5
T = T L = 10 10 = 00 Q=Q2+0=0
Q=1 d = 101
d = 00 i=4
i=4 d = 101 0 = 101 (ii) Placement of low-VT transistors on the signal propagation path
d = 00 (1 1 + 1 0) = T=d from the input node to the output improves the performance
00 01 = Ve substantially.
T = T + L = 00 + 10 = 10 T = 101 (iii) A logical intersection of the conditions illustrated in (a)
Q=Q1=0 T = T L = 101 11 = 10 and (b) requires an optimised choice that leads to the minimum
i=5 Q=1 EDP.
step 2 d = 100 i=40
i=4 d = 100; i = 3
d = 100 (1 0 + 0 1) = d = 100(1 1 1 0) = 11 The entire algorithm in this Letter was simulated and their func-
100 00 = 100 tionality was examined by spice spectre simulator. Performance
T = d = 100 T = d = 11 parameters like propagation delay and dynamic power consump-
step 3 T = T L = 100 10 = 10 T = TL = 11 11 = 0
tions analysis of this Letter was calculated using standard 90 nm
Q=1 Q = 11
CMOS technology with 1 V power supply, operated at 250 MHz.
d = 100 i=30
As shown, the application of the Vedic division methodology
i=3 d = 01; i = 2
d = 100 (1 1 + 1 0) = d = d (1 1 + 1 1) = 01
reduces the iteration resulted the reduction of propagation delay
100 01 = 11 10 = Ve and dynamic switching power consumptions.
T = d = 11 T = T + L = 0 + 11 = 11 To implement the Vedic divider like (4 4), (4 8), (4 16),
Q = Q 1 = 11 1 = 10 (8 4), (8 8), (8 16) etc. bits, all the individual modules
i=i+1=3 such as subtractor, adder, cross-multiplier etc. were implemented
step 4 T = T L = 11 10 = 01 d = 111 through TG to make the circuit faster. The individual performance
Q = 11 i=i1=31=2 parameters such as propagation delay, dynamic switching power
d = 10 d = 111 (1 1 + 1 0) = 111 consumption, EDPs and PDPs for different circuit modules have
01 = 110 been computed. With the help of all the modules, the nal simu-
i=2 T = d = 110 lation has been carried out and performance parameters have been
d = 10 (1 1 + 1 1) = calculated. Comparative study between different architectures and
10 10 = 00 proposed architecture like (4 4), (4 8), (4 16), (8 4), (8 8),
T = d = 00 (8 16) etc., bit divider is shown in Table 3. Proper modications
step 5 Q = 110 T = TL = 110 11 = 11 at the device, circuit and architectural levels of design hierarchy
d = 01 Q = 101 have been analysed in terms of propagation delay, average
i=1 d = 110, i = 1 power dissipation and their products. The values of delay,
d = 01 01 = 00 d = 110 (1 1 + 1 0) = 110 power, EDP and PDP of different architectures are measured
1 = 101
and tabulated in Table 3. The EDP (1021) J s and PDP
T = d = 00 T = d = 101
(1012) J are quantitative measures of the efciency and a com-
step 6 Q = 1100 T = 101 11 = 10
promise between speed and power dissipations. EDPs and PDPs
d = 00 Q = 1011
i=0 d = 101; i = 0
are particularly important when high-speed operation is needed
d = 00 00 = 00 d = 101 (1 1 + 1 1) = 101 and its comparison at 1 V supplies voltage with 90 nm CMOS
10 = 11 technology. Input data were taken in a regular fashion for experi-
T = d 00 T = d = 11 mental purpose. For each transition, the delay is measured from
step 7 Q = 11000 T = T L = 11 11 = 0 50% of the input voltage swing to 50% of the output voltage
d = 00 Q = 10111 swing.
i = 1 d = 00 It is worth mentioning here that we have taken the implementa-
d = 00 00 = 00 i=1 tion methodology from different references [9, 11, 15] and imple-
T = d = 00 d = 00 (1 1 + 1 1) = 00 mented in the same technological environments (spice spectre
10 = Ve with standard 90 nm CMOS technology) and then compared the
T = T + L = 00 + 11 = 11 performance parameters. The propagation delay and switching
Q = Q 1 = 10110, i = 0 power are the worst-case delay and power of all possible bit combi-
step 8 Q = 110000 d = 110, i = 1 nations. It can be observed from Table 3 (32 16) bit squarer
d = 110 (1 0 + 1 1) = 110 requires 300 ns to propagate a signal and consumes 32.53 mw
1 = 101 power for a layout area of 17.39 mm2. Proposed architecture
T = d = 101 offered 47.3, 38.4, 34% faster operation (propagation delay)
Continued than DR [9], NR [11] and GS [15] architecture, respectively.
On the other hand corresponding reduction of power consumption
6
Fig. 4 Hardware implementation of divider using dhvajanka formula
Fig. 5 Latency analysis of divider using dhvajanka formula
7
Table 3 Performance parameters like propagation delay (ns), dynamic switching power consumption (mW), EDP (1024 J s), PDP (1012 J), % savings in
terms of propagation delay and dynamic switching power consumption compared with proposed methodology, as a function of input number of bits. The
architecture has been implemented through spice spectre (T-Spice V13) simulator, with 90 nm CMOS technology. For each transition, the delay is
measured from 50% of the input voltage swing to 50% of the output voltage swing
Input no. of Architectures Delay, nS Power, mW EDP (1021) J S PDP (1012) J Improvement Improvement
bits in delay, % in power, %
44 DR [9] 12.8 0.99 162.2016 12.672 47.65 37.3

NR [11] 11.18 0.88 109.9933 9.8384 40 29.5
GS [15] 10.44 0.78 85.01501 8.1432 35.8 20.5
proposed 6.7 0.62 27.83 4.15
48 DR [9] 19.79 1.59 622.7141 31.4661 48.86 37.7
NR [11] 17.3 1.34 401.0486 23.182 41.5 26.1
GS [15] 16.16 1.25 326.432 20.2 37.3 20.8
proposed 10.12 0.99 101.39 10.01
4 16 DR [9] 33.7 2.97 3372.999 100.089 47 42
NR [11] 29.41 2.47 2136.422 72.6427 39.3 32.03
GS [15] 27.55 2.44 1851.966 67.222 35.2 29.5
proposed 17.84 1.72 547.41 30.68
84 DR [9] 36.34 3.17 4186.288 115.1978 47 39.1
NR [11] 31.78 2.66 2686.516 84.5348 40 27.4
GS [15] 29.7 2.5 2205.225 74.25 36 22.8
proposed 19.0 1.93 696.73 36.67
88 DR [9] 50.3 4.42 11 183 222.326 48.11 37.7
NR [11] 43.4 3.68 6931.501 159.712 39.86 25.2
GS [15] 41.0 3.45 5799.45 141.45 36.3 20.28
proposed 26.1 2.25 1873.32 71.77
8 16 DR [9] 78.07 6.93 42 237.83 541.0251 34.8 35
NR [11] 68.25 5.775 26 900.31 394.1438 25.4 22
GS [15] 63.8 5.46 22 224.6 348.348 20.12 17
proposed 50.9 4.5 11 658.65 229.05
16 4 DR [9] 115.5 10.01 133 535.9 1156.155 47.7 38
NR [11] 101.0 8.47 86 402.47 855.47 40.21 26.8
GS [15] 94.3 8.02 71 317.77 756.286 35.9 22.6
proposed 60.38 6.2 22 603.6 374.35
16 8 DR [9] 143.3 12.77 262 230.5 1829.941 46.9 37.9
NR [11] 125.31 10.66 167 389.7 1335.805 39.3 25.7
GS [15] 117.0 9.59 131 277.5 1122.03 35.04 17.4
proposed 76.0 7.92 45 745.92 601.92
16 16 DR [9] 198.91 17.67 699 116.9 3514.74 42.1 34.7
NR [11] 173.5 14.05 422 936.6 2437.675 33.7 17.95
GS [15] 162.3 13.95 367 461 2264.085 29.1 17.3
proposed 115 11.53 152 484.3 1325.95
32 4 DR [9] 402.0 35.79 5 783 807 14 387.58 45.29 36
NR [11] 351.6 29.86 3 691 370 10 498.78 37.45 23.3
GS [15] 328.4 28.08 3 028 331 9221.472 33.0 18.4
proposed 219.9 22.9 1 107 353 5035.71
32 8 DR [9] 457.8 40.78 8 546 707 18 669.08 45.19 37.4
NR [11] 400.28 34.82 5 579 002 13 937.75 37.3 26.7
GS [15] 375.9 31.98 4 518 800 12 021.28 33.2 20.2
proposed 250.9 25.5 1 605 246 6397.95
32 16 DR [9] 570.01 49.5 16 083 114 28 215.5 47.3 34.2
NR [11] 487.52 41.25 9 804 125 20 110.2 38.4 21.18
GS [15] 454.75 39.7 8 209 863 18 053.58 34 18.06
proposed 299.92 32.53 2 926 139 9756.398
were 34.2, 21.18 and 18.06%, respectively, compared with the processor as compared with the conventional method. In addition to
same architectures. The layout of the proposed (32 16) bit divider that the proposed algorithm is used efciently so that it takes
shown in Fig. 6, was implemented using L-Edit (T-Spice V-13) and minimum stages for the division, which eventually reduces signi-
the corresponding area was found to be 17.39 mm2. cant operational time.
In division circuitry, an (32 16) bit divider implementation was
transformed into just small division instead of actual divisor, sub-
5 Conclusions
traction and few multiplications, thereby reduces the iteration,
A new division approach based on Vedic mathematics has been owing to the substantial reduction in propagation delay. The propa-
proposed for ultra-high-speed and low-power very large scale inte- gation delay for (32 16) bit division was only 300 ns, whereas
gration applications. Proposed approach is applied in (32 16) div- the power consumption of the same was 32.53 mW for a layout
ision and it is found that it involves minimum memory space of the area of 17.39 mm2. Improvement in speed were found to be
8
[5] Aggarwal N., Asooja K., Verma S.S., Negi S.: An improvement in
the restoring division algorithm (needy restoring division algorithm).
Proc. IEEE Int. Conf. Computer Science and Information
Technology, Beijing, August 2009, pp. 246249
[6] Sutter G., Deschamps J.P.: High speed xed point divider for
FPGAS. Proc. IEEE Int. Conf. Field Programmable Logic and
Applications, Prague, August 2009, pp. 448452
[7] Sutter G., Deschamps J.P.: Fast radix 2k divider for FPGAs. Proc.
IEEE Int. Conf. Programmable Logic, Sao Carlos, April 2009, pp.
115122
[8] Jun K., Swartzlander E.E.Jr.: Modied non-restoring division
algorithm with improved delay prole and error correction.
Proc. IEEE Int. Conf. Signals System and Computer, 2012, pp.
14601464
[9] Liu W., Nannarelli A.: Power efcient division and square root unit,
IEEE Trans. Comput., 2012, 61, (8), pp. 10591070
[10] Louvet N., Muller J.M., Panhaleux A.: NewtonRaphson algorithms
for oating-point division using an FMA. Proc. IEEE Int. Conf.
Application Specic Systems Architectures and Processors, Rennes,
France, July 2010, pp. 200207
[11] Piso D., Bruguera J.D.: Simplifying the rounding for Newton
Raphson algorithm with parallel remainder. Proc. IEEE Int. Conf.
Signals Systems and Computers, Pacic Grove, CA, USA,
November 2009, pp. 921925
[12] Nenadic N.M., Mladenovic S.B.: Fast division on xed-point DSP
processors using NewtonRaphson method. Proc. IEEE Int. Conf.
Fig. 6 Layout of the proposed (32 16) bit Vedic divider. Layout was Computers as a Tool, Belgrade, November 2005, pp. 705708
implemented through L-Edit (T-Spice V-13) simulator the corresponding [13] Guy E., Seidel P.-M.M., Warren E.F.Jr.: A parametric error analysis
area was found to be 17.39 mm2 of Goldschmidts division algorithm. Proc. IEEE Int. Conf.
Computer Arithmetic, June 2003, pp. 165171
[14] Ercegovac M.D., Imbert L., Matula D.W., Muller J.-M., Wei G.:
47.3, 38.4 and 34% for (32 16) bit division circuitry, whereas Improving Goldschmidt division, square root, and square root recip-
corresponding reduction of power consumption were 34.2, rocal, IEEE Trans. Comput., 2000, 49, (7), pp. 759763
[15] Kong I., Swartzlander E.E.Jr: A rounding method to reduce the
21.18 and 18.06% compared with DR and NR- and G required multiplier precision for Goldschmidt division, IEEE
S-based implementation, respectively. Trans. Comput., 2010, 59, (12), pp. 17031708
[16] Maharaja J.S.S.B.K.T.: Vedic mathematics (Motilal Banarsidass
6 References Publishers Pvt Ltd, Delhi, 2001)
[17] Saha P., Banerjee A., Bhattacharyya P., Dandapat A.: Vedic divider:
[1] Juang T.-B., Chen S.-H.H., Li S.M.: A novel VLSI iterative novel architecture (ASIC) for high speed VLSI applications. Proc.
divider architecture for fast quotient generation. Proc. IEEE Int. IEEE Int. Symp. System Design, Kochi, India, December 2011,
Symp. Circuits and Systems 2011, Seattle, WA, USA, May 2008, pp. 6771
pp. 33583361 [18] Saha P., Banerjee A., Dandapat A., Bhattacharyya P.: ASIC design
[2] Oberman S.F., Flynn M.J.: Division algorithms and implementa- of a high speed low power circuit for calculation of factorial of
tions, IEEE Trans. Comput., 1997, 46, (8), pp. 833854 4-bit numbers based on ancient vedic mathematics,
[3] Deschamps J.-P., Bioul G.J.A., Sutter G.D.: Synthesis of arithmetic Microelectron. J. (Elsevier), 2011, 42, (12), pp. 13431352
circuits, FPGA, ASIC and embedded system (John Wiley & Sons, [19] Uyemura J.P.: CMOS logic circuit design (Kluwer Academic
Inc., 2006) Publishers, 2001)
[4] Hagglund R., Lowenborg P., Vesterbacka M.: A polynomial-based [20] Chang C.H., Gu J., Zhang M.: Ultra low-voltage low-power CMOS
division algorithm, Proc. IEEE Int. Symp. Circuits Syst., 2002, 3, 4-2 and 5-2 compressors for fast arithmetic circuits, IEEE Trans.
pp. 571574 Circuits Syst, I, 2004, 51, (10), pp. 19851997
9

Newww

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Newww

Diunggah oleh

Hak Cipta:

Format Tersedia

Vedic division methodology for high-speed very large scale integration applications

Prabir Saha1, Deepak Kumar2, Partha Bhattacharyya3, Anup Dandapat1

2 Vedic division methodology 2.1 Numerical example of Dhvajanka sutra

Implementation steps of Fig. 1a Implementation steps of Fig. 1b

2.3 Mathematical modelling of Dhvajanka sutra

A can be expressed in terms of B as

2.5 Flowchart diagram of the algorithm 3.2 Latency of the divider

3 Divider implementation technique

Steps Example 1 Example 2 step 9 T = T L = 101 11 = 10

Fig. 5 Latency analysis of divider using dhvajanka formula

44 DR [9] 12.8 0.99 162.2016 12.672 47.65 37.3

Anda mungkin juga menyukai