Published in The Journal of Engineering; Received on 19th November 2013; Accepted on 7th January 2014
Abstract: Transistor level implementation of division methodology using ancient Vedic mathematics is reported in this Letter. The potentiality
of the Dhvajanka (on top of the ag) formula was adopted from Vedic mathematics to implement such type of divider for practical very large
scale integration applications. The division methodology was implemented through half of the divisor bit instead of the actual divisor, sub-
traction and little multiplication. Propagation delay and dynamic power consumption of divider circuitry were minimised signicantly by stage
reduction through Vedic division methodology. The functionality of the division algorithm was checked and performance parameters like
propagation delay and dynamic power consumption were calculated through spice spectre with 90 nm complementary metal oxide semicon-
ductor technology. The propagation delay of the resulted (32 16) bit divider circuitry was only 300 ns and consumed 32.5 mW power for
a layout area of 17.39 mm2. Combination of Boolean arithmetic along with ancient Vedic mathematics, substantial amount of iterations were
reduced resulted as 47, 38, 34% reduction in delay and 34, 21, 18% reduction in power were investigated compared with the mostly
used (e.g. digit-recurrence, NewtonRaphson, Goldschmidt) architectures.
1 Introduction
thereby it cannot be optimised like a parallel multiplier [13]. The
Division is a fundamental operation in many scientic and engin- drawback of these methods is operands should be previously nor-
eering applications, like arithmetic computation, signal processing, malised, most used primitive are multiplications and the remainder
articial intelligence, computer graphics etc. [13]. Generally, com- is not directly obtained.
putations of such division operations are calculated in sequential In algorithmic and structural levels, substantial amount of div-
manner, thereby costlier in terms of propagation delay (latency) ision techniques has so far been developed to reduce the propaga-
compared with other mathematical operations like addition, subtrac- tion delay and power consumption of the divider circuitry; by
tion and multiplication [4]. reducing the iteration, aiming towards high-speed operations,
Substantial amount of works have so far been investigated by but principle behind division techniques are same in all cases.
various researchers to implement the high-speed divider [115] Vedic mathematics [16] is the ancient system of mathematics
like digit recurrence (DR) methodology (restoring [1, 3, 5], non- which has unique computation techniques based on 16 sutras (for-
restoring [2, 6, 9]), division by convergence (NewtonRaphson mulae). Recently, we [17] reported on a Vedic divider based on
(NR) method [1012]), division by series expansion Nikhilam Navatascaramam Dasatah for some specic number
(Goldschmidt (GS) algorithm [13, 14]) etc. Generally, division system, like, the divisor was chosen very close to the base of
architectures can be classied into two categories: namely (i) iter- operations. The implementation reduces the number of iterations,
ation based and (ii) multiplication based. Iterative divisions if the divisor is closer to the base of operation, otherwise
consist of shift-and-subtract operations, generates one quotient increases the iterations, a serious bottleneck of the algorithm. In
bits, in each of the iterations, like radix-2 restoring and non- this Letter, we report on a division technique and its transistor
restoring division. Thereby, in iterative division, after each subtrac- level implementation of such circuitry based on such ancient
tion cycle, it should require to check whether the resulting remain- mathematics. Dhvajanka is a Sanskrit term indicating on top of
der is lesser than the divisor or negative. The cost in terms of the ag, is adopted from Vedas; formula is encountered to imple-
computational complexity of DR algorithms [13, 5, 6, 9] is low ment the division circuitry. In this approach, divider implementa-
because of the large number of iterations; therefore latency tion was transformed into just small division instead of actual
becomes high. Although, some of the researcher rely on higher divisor, subtraction and few multiplication, thereby reduces the
radix implementation of DR algorithm [6, 7, 10] to reduce the itera- iterations, owing to the substantial reduction in propagation
tions, therefore the latency becomes improved from earlier reports delay. Transistor level (application specic integrated circuit
[13, 5, 9], but these schemes additionally increases the hardware (ASIC)) implementation of such division circuitry was carried out
complexity. Some other attractive ideas are based on functional by the combination of Boolean arithmetic with Vedic mathematics,
iterations, like NR [1012] and GS [1315] algorithm, utilises performance parameters like propagation delay, dynamic switching
multiplication techniques along-with the series expansion, where power consumption calculation of the proposed method was cal-
the amount of quotient bits obtained in each of the iterations is culated by using spice spectre in 90 nm complementary metal
doubled. These methods converge quadratically towards the quo- oxide semiconductor (CMOS) technology and compared with
tient when the number of iterations is increased, thereby latency other designs like DR- [9], NR- [11], and GS [15]-based imple-
becomes high. Each iterations of NR and GS methods involve mentation. The calculated results revealed (32 16) bit divider cir-
two dependent multiplications; namely, the product of the rst cuitry has propagation delay 300 ns with 32.53 mW dynamic
multiplication is one of the operands of the second multiplication switching power for a layout area of 17.39 mm2.
J Eng 2014 This is an open access article published by the IET under the Creative Commons
doi: 10.1049/joe.2013.0213 Attribution License (http://creativecommons.org/licenses/by/3.0/)
1
Fig. 1 Illustration of Dhvajanka sutra
a Small divisor with exact division (remainder 0)
b Large divisor (remainder 0), have been considered for illustration purpose
Table 1 Chart implementation procedure, the example has been considered from Fig. 1
1. One digit of divisor has been put on top; we allot one place (at the right 1. Two digits have been put on top; we allot two places (at the right end of
end of the dividend) to the remainder portion of the answer and mark it the dividend) to the oating point portion of the answer and mark it off from
off from the digit by a vertical line the digit by a vertical line
2. 38 is divided by the most signicant digit (MSD) of the divisor (i.e. 7). 2. 135 is divided by the MSD part of the divisor (i.e. 16). Quotient is 8 and
Quotient is 5 and remainder is 3. This remainder will be used for next step remainder is 7. This remainder will be used for next step division
division
3. In this step, our actual gross dividend is 39 is subtracted by the value 3. In this step, our actual gross dividend is 77 is subtracted by the value
obtained by multiplying previous quotient (i.e. 5) with least signicant obtained by multiplying previous quotient (i.e. 8) with LSD part of divisor
digit (LSD) of divisor (i.e. 3). {39 (5 3) = 24}. After subtraction, the (i.e. 3). {77 (8 3) = 53}. After subtraction, the result is again divided by
result is again divided by 7. Quotient becomes 3 and remainder becomes 3 16. Quotient becomes 3 and remainder becomes 5
4. In the third stage, the gross dividend is 38. Again it is subtracted by 9 4. In the third stage, the gross dividend is 59. Again it is subtracted by
(3 3) similar to the previous step. Thus, the result (38 9 = 29) is cross-multiplication of LSD part of the divisor and the obtained quotient,
divided by 7. Quotient is 4 and remainder is 1 that is, (59 (3 3 + 8 2) = 34). Thus, the result is divided by 16. Quotient
is 2 and remainder is 2
5. This is the nal stage for this example. Here, the nal remainder is 5. In the fourth stage gross dividend is 21, again it is subtracted by the
calculated. Actual gross dividend is subtracted by 12 and the result in the cross-multiplication of the digits of the quotients and LSD parts of the
nal remainders divisor and results become 9. This result is divided again by 16, quotient
becomes 0 and remainder becomes 9
6. Thus, we say quotient is equals to 534 and remainder is equals to 0 6. The process continues until the number of iterations
7. Thus, the results become 83.205
This is an open access article published by the IET under the Creative Commons J Eng 2014
Attribution License (http://creativecommons.org/licenses/by/3.0/) doi: 10.1049/joe.2013.0213
2
quotient and 21x 2 + 9x there from and then obtain 3x 2x as
the remainder.
3. However, this 3x 2 is equals to 30x which (with x + 2) gives us
29x + 12 as the last step dividend. Again multiplying the divisor by
4, we obtain the product 28x + 12; and subtract this 28x + 12,
thereby obtaining x 10 as the remainder. However, x is being
10, thus the remainder vanishes.
a a a a
= an1 xn1 + n1 bm2 xn2 + + n1 b2 x(n/2)+2 + n1 b1 x(n/2)+1 + n1 b0 x(n/2)
bm1 bm1 bm1 bm1
a a
+ an2 n1 bm2 xn2 + + a(n/2)+2 n1 b2 x(n/2)+2 (2)
bm1 bm1
a a a a2 (. . .) /bm1
+ a(n/2)+1 n1 b1 x(n/2)+1 + an/2 n1 b0 x(n/2) + + a0 1
bm1 bm1 bm1
= bm1 x(n/2)1 + bm2 x(n/2)2 + + b0
(3)
a a an1 / bm1 bm2 (n/2)1 a a1 a2 (. . .) /bm1 /bm1 b0
n1 xn/2 + n2 x + + 0
bm1 bm1 bm1
a a3
a2 3 b0 a2 b
b1 b1 0
a1 b0 a1 b0
a b1 b1
a2 3 b0
a3 2 b1
x b1 x + b0 + x b1 x + b0 + b1 x + b0 +
a0 b0
b1 b1 b1 b1
= (4)
b1 x + b0
a3 a
a2 b a2 3 b0
b1 0 b1
a1 b0 a1 b0
b1
b1
a3 2 a (a3 /b1 )b0
x + 2 x+ b1 x + b0 a0 b0
b b1 b1 b1 (5)
1
= +
b1 x + b0 b1 x + b0
J Eng 2014 This is an open access article published by the IET under the Creative Commons
doi: 10.1049/joe.2013.0213 Attribution License (http://creativecommons.org/licenses/by/3.0/)
3
m1
n1
i=n(m/2) ai 2
i i
remainder. Through the algebraic identity the equations can be re- Step 2: Determine i=(m/2) bi 2 . Suppose the
written as rst borrow 0, then through multiplexer it will set the quotient
(Qn) 1 and the remainder is R.
Q( x ) Step 3: Determine Qn b(m/2)1. Concatenate R and an(m/2)1 and
subtract Qn b(m/2)1. Again divide in similar procedure (step 1).
a a (a3 /b1 )b0 a a2 (a3 /b1 )b0 /b1 b0 Set the quotient bit Qn1 and remainder R.
= 3 x2 + 2 x+ 1
b1 b1 b1 Step 4: Determine Qn1 b(m/2) 2 + Qn b(m/2) 1. Concatenate R
and an(m/2)2 and subtract Qn1 b(m/2) 2 + Qn b(m/2) 1. Again
a a2 (a3 /b1 )b0 /b1 b0 divide in similar procedure (step 1). Set the quotient bit Qn1 and
and R = a0 1 b0
b1 remainder R.
a3
a2 b
a3 b1 0
a2 b a1 b0
b1 0 b1
a1 b0
a b1
a2 3 b0 (a0 b0 )
a3 2 b1 b1 R (6)
=
b x + x+ +
= Q( x) +
1 b1 b1 b1 x + b0 g(x)
This is an open access article published by the IET under the Creative Commons J Eng 2014
Attribution License (http://creativecommons.org/licenses/by/3.0/) doi: 10.1049/joe.2013.0213
4
Fig. 3 Flowchart representation of divider using dhvajanka formula
partial product addition in rst stage. For second stage requires 4 Results and discussion
m/6 XOR gate and so on, thus total addition purpose may be
The advantages of CMOS transmission gate (TG) logic over con-
approximated as m + (m/2) = (3m/2) XOR gate delay. Also for
ventional CMOS and complementary pass transistor logic (CPL)
multiplication approximated, maximum XOR gate delay equals to
[19, 20] logic are well established. As the CMOS TG consists of
3m/2. In the fth stage, m/2 bit subtractor is required, thereby crit-
one p-channel MOSFET (PMOS) and one n-channel MOSFET
ical path delay of m/2 bit subtractor equals to (m/2) 3 XOR
(NMOS), connected in parallel, the ON resistance is smaller
gate delay.
than even a single NMOS. Proper modications at the device,
Thus, total propagation delay for each of the iterations may be
circuit and architectural levels of design hierarchy have been imple-
approximated as
mented to reduce the energy delay product (EDP) and power delay
product (PDP) for the proposed design. TGs are used for the design
of different modules for faster operation and better logic transform-
tpd = tstage1 + tstage2 + tstage3 + tstage4 + tstage5 ation. Dual threshold voltage (VT) operating mode was considered
for simulation to determine the performance parameters. The
= (2m + 2) + (3m/2) + 2n + 3n + (3m/2)
proper choice of threshold voltages for a particular transistor in
+ (3m/2) = [5n + (13m/2) + 2] the circuit is based on a number of logics as described below:
(i) Placement of high-VT transistors on the leakage path directly
XOR gate delay. Thereby n iteration may consume n(5n + (13m/ between supply and ground reduces the subthreshold leakage
2) + 2) XOR gate delay. current and hence static power.
J Eng 2014 This is an open access article published by the IET under the Creative Commons
doi: 10.1049/joe.2013.0213 Attribution License (http://creativecommons.org/licenses/by/3.0/)
5
Table 2 Illustration of owchart with the help of the examples. Example Table 2 Continued
1 has been considered for complete division (remainder = 0), Example 2
has been considered for incomplete division (remainder 0) Steps Example 1 Example 2
This is an open access article published by the IET under the Creative Commons J Eng 2014
Attribution License (http://creativecommons.org/licenses/by/3.0/) doi: 10.1049/joe.2013.0213
6
Fig. 4 Hardware implementation of divider using dhvajanka formula
J Eng 2014 This is an open access article published by the IET under the Creative Commons
doi: 10.1049/joe.2013.0213 Attribution License (http://creativecommons.org/licenses/by/3.0/)
7
Table 3 Performance parameters like propagation delay (ns), dynamic switching power consumption (mW), EDP (1024 J s), PDP (1012 J), % savings in
terms of propagation delay and dynamic switching power consumption compared with proposed methodology, as a function of input number of bits. The
architecture has been implemented through spice spectre (T-Spice V13) simulator, with 90 nm CMOS technology. For each transition, the delay is
measured from 50% of the input voltage swing to 50% of the output voltage swing
Input no. of Architectures Delay, nS Power, mW EDP (1021) J S PDP (1012) J Improvement Improvement
bits in delay, % in power, %
were 34.2, 21.18 and 18.06%, respectively, compared with the processor as compared with the conventional method. In addition to
same architectures. The layout of the proposed (32 16) bit divider that the proposed algorithm is used efciently so that it takes
shown in Fig. 6, was implemented using L-Edit (T-Spice V-13) and minimum stages for the division, which eventually reduces signi-
the corresponding area was found to be 17.39 mm2. cant operational time.
In division circuitry, an (32 16) bit divider implementation was
transformed into just small division instead of actual divisor, sub-
5 Conclusions
traction and few multiplications, thereby reduces the iteration,
A new division approach based on Vedic mathematics has been owing to the substantial reduction in propagation delay. The propa-
proposed for ultra-high-speed and low-power very large scale inte- gation delay for (32 16) bit division was only 300 ns, whereas
gration applications. Proposed approach is applied in (32 16) div- the power consumption of the same was 32.53 mW for a layout
ision and it is found that it involves minimum memory space of the area of 17.39 mm2. Improvement in speed were found to be
This is an open access article published by the IET under the Creative Commons J Eng 2014
Attribution License (http://creativecommons.org/licenses/by/3.0/) doi: 10.1049/joe.2013.0213
8
[5] Aggarwal N., Asooja K., Verma S.S., Negi S.: An improvement in
the restoring division algorithm (needy restoring division algorithm).
Proc. IEEE Int. Conf. Computer Science and Information
Technology, Beijing, August 2009, pp. 246249
[6] Sutter G., Deschamps J.P.: High speed xed point divider for
FPGAS. Proc. IEEE Int. Conf. Field Programmable Logic and
Applications, Prague, August 2009, pp. 448452
[7] Sutter G., Deschamps J.P.: Fast radix 2k divider for FPGAs. Proc.
IEEE Int. Conf. Programmable Logic, Sao Carlos, April 2009, pp.
115122
[8] Jun K., Swartzlander E.E.Jr.: Modied non-restoring division
algorithm with improved delay prole and error correction.
Proc. IEEE Int. Conf. Signals System and Computer, 2012, pp.
14601464
[9] Liu W., Nannarelli A.: Power efcient division and square root unit,
IEEE Trans. Comput., 2012, 61, (8), pp. 10591070
[10] Louvet N., Muller J.M., Panhaleux A.: NewtonRaphson algorithms
for oating-point division using an FMA. Proc. IEEE Int. Conf.
Application Specic Systems Architectures and Processors, Rennes,
France, July 2010, pp. 200207
[11] Piso D., Bruguera J.D.: Simplifying the rounding for Newton
Raphson algorithm with parallel remainder. Proc. IEEE Int. Conf.
Signals Systems and Computers, Pacic Grove, CA, USA,
November 2009, pp. 921925
[12] Nenadic N.M., Mladenovic S.B.: Fast division on xed-point DSP
processors using NewtonRaphson method. Proc. IEEE Int. Conf.
Fig. 6 Layout of the proposed (32 16) bit Vedic divider. Layout was Computers as a Tool, Belgrade, November 2005, pp. 705708
implemented through L-Edit (T-Spice V-13) simulator the corresponding [13] Guy E., Seidel P.-M.M., Warren E.F.Jr.: A parametric error analysis
area was found to be 17.39 mm2 of Goldschmidts division algorithm. Proc. IEEE Int. Conf.
Computer Arithmetic, June 2003, pp. 165171
[14] Ercegovac M.D., Imbert L., Matula D.W., Muller J.-M., Wei G.:
47.3, 38.4 and 34% for (32 16) bit division circuitry, whereas Improving Goldschmidt division, square root, and square root recip-
corresponding reduction of power consumption were 34.2, rocal, IEEE Trans. Comput., 2000, 49, (7), pp. 759763
[15] Kong I., Swartzlander E.E.Jr: A rounding method to reduce the
21.18 and 18.06% compared with DR and NR- and G required multiplier precision for Goldschmidt division, IEEE
S-based implementation, respectively. Trans. Comput., 2010, 59, (12), pp. 17031708
[16] Maharaja J.S.S.B.K.T.: Vedic mathematics (Motilal Banarsidass
6 References Publishers Pvt Ltd, Delhi, 2001)
[17] Saha P., Banerjee A., Bhattacharyya P., Dandapat A.: Vedic divider:
[1] Juang T.-B., Chen S.-H.H., Li S.M.: A novel VLSI iterative novel architecture (ASIC) for high speed VLSI applications. Proc.
divider architecture for fast quotient generation. Proc. IEEE Int. IEEE Int. Symp. System Design, Kochi, India, December 2011,
Symp. Circuits and Systems 2011, Seattle, WA, USA, May 2008, pp. 6771
pp. 33583361 [18] Saha P., Banerjee A., Dandapat A., Bhattacharyya P.: ASIC design
[2] Oberman S.F., Flynn M.J.: Division algorithms and implementa- of a high speed low power circuit for calculation of factorial of
tions, IEEE Trans. Comput., 1997, 46, (8), pp. 833854 4-bit numbers based on ancient vedic mathematics,
[3] Deschamps J.-P., Bioul G.J.A., Sutter G.D.: Synthesis of arithmetic Microelectron. J. (Elsevier), 2011, 42, (12), pp. 13431352
circuits, FPGA, ASIC and embedded system (John Wiley & Sons, [19] Uyemura J.P.: CMOS logic circuit design (Kluwer Academic
Inc., 2006) Publishers, 2001)
[4] Hagglund R., Lowenborg P., Vesterbacka M.: A polynomial-based [20] Chang C.H., Gu J., Zhang M.: Ultra low-voltage low-power CMOS
division algorithm, Proc. IEEE Int. Symp. Circuits Syst., 2002, 3, 4-2 and 5-2 compressors for fast arithmetic circuits, IEEE Trans.
pp. 571574 Circuits Syst, I, 2004, 51, (10), pp. 19851997
J Eng 2014 This is an open access article published by the IET under the Creative Commons
doi: 10.1049/joe.2013.0213 Attribution License (http://creativecommons.org/licenses/by/3.0/)
9