Anda di halaman 1dari 36


Mini Project Report
Submitted in the Partial Fulfillment of the
for the Award of the Degree of







Under the Guidance of

Associate Professor
Department of ECE

Department of Electronics and Communication Engineering


(Approved by AICTE, Affiliated to JNTUH & Accredited by NBA)

2013- 14


Es td.1999

Shamshabad, Hyderabad 501218

Department of Electronics and Communication Engineering


This is to certify that the mini project report work entitled Bit-Serial Multiplier Using
Verilog HDL carried out by Mr. K.Bhargav, Roll Number 11885A0401, Mr. P.Devsingh, Roll
Number 11885A0404, submitted to the department of Electronics and Communication
Engineering, in partial fulfillment of the requirements for the award of degree of Bachelor of
Technology in Electronics and Communication Engineering during the year 2013 2014.

Name & Signature of the Supervisor

Name & Signature of the HOD

Mr. S. Rajendar

Dr. J. V. R. Ravindra

Associate Professor

Head, ECE

Kacharam (V), Shamshabad (M), Ranga Reddy (Dist.) 501 218, Hyderabad, A.P.
Ph: 08413-253335, 253201, Fax: 08413-253482,

The satisfaction that accompanies the successful completion of the task would be
put incomplete without the mention of the people who made it possible, whose constant
guidance and encouragement crown all the efforts with success.
I express my heartfelt thanks to Mr. S. Rajendar, Associate Professor, technical
seminar supervisor, for her suggestions in selecting and carrying out the in-depth study of
the topic. Her valuable guidance, encouragement and critical reviews really helped to
shape this report to perfection.
I wish to express my deep sense of gratitude to Dr. J. V. R. Ravindra, Head of
the Department for his able guidance and useful suggestions, which helped me in
completing the technical seminar on time.
I also owe my special thanks to our Director Prof. L. V. N. Prasad for his intense
support, encouragement and for having provided all the facilities and support.
Finally thanks to all my family members and friends for their continuous support
and enthusiastic help.




Bit-serial arithmetic is attractive in view of it is smaller pin count, reduced wire
length, and lower floor space requirement in VLSI. In fact ,the compactness of the design
may allow us to run a bit-serial multiplier at a clock rate high enough to make the unit
almost competitive with much more complex designs with regard to speed. In addition, in
certain application contexts inputs are supplied bit-serially anyway. In such a case, using
a parallel multiplier would be quite wasteful, since the parallelism may not lead to any
speed benefit. Furthermore, in applications that call for a large number of independent
multiplications, multiple bit-serial multiplier may be more cost-effective than a complex
highly pipelined unit.
Bit-serial multipliers can be designed as systolic arrays: synchronous arrays of
processing element that are interconnected by only short, local wires thus allowing very
high clock rates. Let us begin by introducing a semi systolic multiplier, so named because
its design involves broadcasting a single bit of the multiplier x to a number of circuit
element, thus violating the short, local wires requirement of pure systolic design.







List Of Figures



1.1 The Context of Computer Arithmetic

1.2 What is computer arithmetic

1.3 Multiplication

1.4 Organization of report


2.1 Introduction

2.2 What is VLSI?

2.2.1 History of Scale Integration

2.3 Advantages of ICs over discrete components

2.4 VLSI And Systems

2.5 Applications of VLSI

2.6 Conclusion



3.1 Introduction


3.2 Major Capabilities




3.4 Conclusion




4.1 Multiplier


4.2 Background


4.2.1 Binary Multiplication


4.2.2 Hardware Multipliers


4.2.3 Array Multipliers


4.3 Variations in Multipliers


4.4 Bit-serial Multipliers




5.1 Tools Used


5.2 Coding Steps


5.3 Simulation steps


5.4 Full adder code


5.5 Full adder flowchart


5.6 Full adder testbench


5.7 Bit-serial multiplier algorithm


5.8 Bit-Serial multiplier code


5.9 Full adder waveform


5.10 Bit-serial multiplier testbench


5.11 Bit-serial multiplier waveforms









Mixed level modeling



Synthesis process



Typical design process



Basic Multiplication Data flow



Two Rows of an Array Multiplier



Data Flow through a Pipelined Array Multiplier



Bit-serial multiplier; 4x4 multiplication in 8 clock cycles



Bit Serial multiplier design in dot notation



Project directory structure



Simulation window



Waveform window



Full adder flowchart



Bit-Serial multiplier flowchart



Full adder output waveforms



Bit serial multiplier input/output waveforms



Bit serial multiplier with intermediate waveforms



1.1 The Context of Computer Arithmetic
Advances in computer architecture over the past two decades have allowed the
performance of digital computer hardware to continue its exponential growth, despite
increasing technological difficulty in speed improvement at the circuit level. This
phenomenal rate of growth, which is expected to continue in the near future, would not
have been possible without theoretical insights, experimental research, and tool-building
efforts that have helped transform computer architecture from an art into one of the most
quantitative branches of computer science and engineering. Better understanding of the
various forms of concurrency and the development of a reasonably efficient and userfriendly programming model has been key enablers of this success story.
The downside of exponentially rising processor performance is an unprecedented
increase in hardware and software complexity. The trend toward greater complexity is not
only at odds with testability and verifiability but also hampers adaptability, performance
tuning, and evaluation of the various trade-offs, all of which contribute to soaring
development costs. A key challenge facing current and future computer designers is to
reverse this trend by removing layer after layer of complexity, opting instead for clean,
robust, and easily certifiable designs, while continuing to try to devise novel methods for
gaining performance and ease-of-use benefits from simpler circuits that can be readily
adapted to application requirements.
In the computer designers quest for user-friendliness, compactness, simplicity,
high performance, low cost, and low power, computer arithmetic plays a key role. It is
one of oldest subfields of computer architecture. The bulk of hardware in early digital
computers resided

in accumulator and other arithmetic/logic circuits. Thus, first-

generation computer designers were motivated to simplify and share hardware to the
extent possible and to carry out detailed cost- performance analyses before proposing a
design. Many of the ingenious design methods that we use today have their roots in the
bulky, power-hungry machines of 30-50 years ago.
In fact computer arithmetic has been so successful that it has, at times, become
transparent. Arithmetic circuits are no longer dominant in terms of complexity; registers,
memory and memory management, instruction issue logic, and pipeline control have
become the dominant consumers of chip area in todays processors. Correctness and high
performance of arithmetic circuits is routinely expected, and episodes such as the Intel

Pentium division bug are indeed rare.

The preceding context is changing for several reasons. First, at very high clock
rates, the interfaces between arithmetic circuits and the rest of the processor become
critical. Arithmetic units can no longer be designed and verified in isolation. Rather, an
integrated design optimization is required, which makes the development even more
complex and costly. Second, optimizing arithmetic circuits to meet design goals by taking
advantage of the strengths of new technologies, and making them tolerant to the
weaknesses, requires a reexamination of existing design paradigms. Finally, incorporation
of higher-level arithmetic primitives into hardware makes the design, optimization, and
verification efforts highly complex and interrelated.
This is why computer arithmetic is alive and well today. Designers and
researchers in this area produce novel structures with amazing regularity. Carrylookahead adders comprise a case in point. We used to think, in the not so distant past,
that we knew all there was to know about carry-lookahead fast adders. Yet, new designs,
improvements, and optimizations are still appearing. The ANSI/IEEE standard floatingpoint format has removed many of the concerns with compatibility and error control in
floating-point computations, thus resulting in new designs and products with mass-market
appeal. Given the arithmetic-intensive nature of many novel application areas (such as
encryption, error checking, and multimedia), computer arithmetic will continue to thrive
for years to come.

1.2 What is computer arithmetic

A sequence of events, begun in late 1994 and extending into 1995, embarrassed
the worlds largest computer chip manufacturer and put the normally dry subject of
computer arithmetic on the front pages of major newspapers. The events were rooted in
the work of Thomas Nicely, a mathematician at the Lynchburg College in Virginia, who
is interested in twin primes (consecutive odd numbers such as 29 and 31 that are both
prime). Nicelys work involves the distribution of twin primes and, particularly, the sum
of their reciprocals S = 1/5 + 1/7 1/11+1/13 +1/17 +1/19+1/29+1/31+-+1/P +1/(p +2) + - -. While it is known that the infinite sum S has a finite value, no one knows what the
value is.
Nicely was using several different computers for his work and in March 1994
added a machine based on the Intel Pentium processor to his collection. Soon he began
noticing inconsistencies in his calculations and was able to trace them back to the values
computed for 1 / p and 1 / (p + 2) on the Pentium processor. At first, he suspected his own
programs, the compiler, and the operating system, but by October, he became convinced

that the Intel Pentium chip was at fault. This suspicion was confirmed by several other
researchers following a barrage of e-mail exchanges and postings on the Internet. The
diagnosis finally came from Tim Coe, an engineer at Vitesse Semiconductor. Coe built a
model of Pentiums floating-point division hardware based on the radix-4 SRT algorithm
and came up with an example that produces the worst-case error. Using double-precision
floating- point computation, the ratio c = 4 195 835/3 145 727 = 1.333 820 44- - - is
computed as 1.333 739 06 on the Pentium. This latter result is accurate to only 14 bits;
the error is even larger than that of single-precision floating-point and more than 10
orders of magnitude worse that what is expected of double-precision computation.
The rest, as they say, is history. Intel at first dismissed the severity of the problem
and admitted only a subtle flaw, with a probability of 1 in 9 billion, or once in 27,000
years for the average spreadsheet user, of leading to computational errors. It nevertheless
published a white paper that described the bug and its potential consequences and
announced a replacement policy for the defective chips based on customer need; that is,
customers had to show that they were doing a lot of mathematical calculations to get a
free replacement. Under heavy criticism from customers, manufacturers using the
Pentium chip in their products, and the on-line community, Intel later revised its policy to
no-questions-asked replacement.
Whereas supercomputing, microchips, computer networks, advanced applications
(particularly chess-playing programs), and many other aspects of computer technology
have made the news regularly in recent years, the Intel Pentium bug was the first instance
of arithmetic (or anything inside the CPU for that matter) becoming front-page news.
While this can be interpreted as a sign of pedantic dryness, it is more likely an indicator
of stunning technological success. Glaring software failures have come to be routine
events in our information-based society, but hardware bugs are rare and newsworthy.
Within the hardware realm, we will be dealing with both general-purpose
arithmetic/logic units (ALUS), of the type found in many commercially available
processors, and special-purpose structures for solving specific application problems. The
differences in the two areas are minor as far as the arithmetic algorithms are concerned.
However, in view of the specific technological constraints, production volumes, and
performance criteria, hardware implementations tend to be quite different. Generalpurpose processor chips that are mass-produced have highly optimized custom designs.
Implementations of 1ow-volume, special-purpose systems, on the other hand, typically
rely on semicustom and off-the-shelf components. However, when critical and strict
requirements, such as extreme speed, very low power consumption, and miniature size,

preclude the use of semicustom or off-the shelf components, the much higher cost of a
custom design may be justified even for a special-purpose system.

1.3 Multiplication
Multiplication (often denoted by the cross symbol "", or by the absence of
symbol) is the third basic mathematical operation of arithmetic, the others being addition,
subtraction and division (the division is the fourth one, because it requires multiplication
to be defined). The multiplication of two whole numbers is equivalent to the addition of
one of them with itself as many times as the value of the other one; for example, 3
multiplied by 4 (often said as "3 times 4") can be calculated by adding 4 copies of 3
together: 3 times 4 = 3 + 3 + 3 + 3 = 12 Here 3 and 4 are the "factors" and 12 is the
"product". One of the main properties of multiplication is that the result does not depend
on the place of the factor that is repeatedly added to it (commutative property). 3
multiplied by 4 can also be calculated by adding 3 copies of 4 together: 3 times 4 = 4 + 4
+ 4 = 12. The multiplication of integers (including negative numbers), rational numbers
(fractions) and real numbers is defined by a systematic generalization of this basic
definition. Multiplication can also be visualized as counting objects arranged in a
rectangle (for whole numbers) or as finding the area of a rectangle whose sides have
given lengths. The area of a rectangle does not depend on which side is measured first,
which illustrates the commutative property. In general, multiplying two measurements
gives a new type, depending on the measurements. For instance: 2.5 meters \times 4.5
meters = 11.25 square meters 11 meters/second times 9 seconds = 99 meters The inverse
operation of the multiplication is the division. For example, since 4 multiplied by 3 equals
12, then 12 divided by 3 equals 4. Multiplication by 3, followed by division by 3, yields
the original number (since the division of a number other than 0 by itself equals 1).
Multiplication is also defined for other types of numbers, such as complex numbers, and
more abstract constructs, like matrices. For these more abstract constructs, the order that
the operands are multiplied sometimes does matter.
Multiplication often realized by k cycles of shifting and adding, is a heavily used






signal processing



applications. In this part, after examining shift/add multiplication schemes and their
various implementations, we note that there are but two ways to speed up the underlying
multi operand addition: reducing the number of operands to be added leads to high-radix
multipliers, and devising hardware multi operand adders that minimize the latency and/or
maximize the throughput leads to tree and array multipliers. Of course, speed is not the
only criterion of interest. Cost, VLSI area, and pin limitations favor bit-serial designs,

while the desire to use available building blocks leads to designs based on additive
multiply modules. Finally, the special case of squaring is of interest as it leads to
considerable simplification

1.4 Organization of report

This report starts with introduction to computer arithmetic and then introduces
multiplication. Then it explains implementation of one of the multiplier bit serial
Chapter 1: Introduction This chapter explains importance of computer arithmetic and
multiplication in computations.
Chapter 2: VLSI This chapter focuses on VLSI and its evolution, also its applications
and advantages
Chapter 3: Verilog HDL This chapter explains how HDLs reduce design cycle in VLSI
and automation makes faster implementation.
Chapter 4: Bit-serial multiplier This chapter explains about multiplier and its types and
how bit serial multiplier is useful.
Chapter 5: Implementation This chapter explains Implementation flow of Bit-serial
multiplier its Verilog code and output waveforms.
Chapter 6: Conclusions This chapter summarizes Bit-serial multiplier and its future

2.1 Introduction

integration (VLSI)

is the

process of creating integrated

circuits by combining thousands of transistor-based circuits into a single chip. VLSI

began in the 1970s when complex semiconductor and communication technologies
were being developed. The microprocessor is a VLSI device. The term is no longer as
common as it once was, as chips have increased in complexity into the hundreds of
millions of transistors.
The first semiconductor chips held one transistor each. Subsequent advances
added more and more transistors, and, as a consequence, more individual functions or
systems were integrated over time. The first integrated circuits held only


devices, perhaps as many as ten diodes, transistors, resistors and capacitors, making it
possible to fabricate one or more logic gates on a single device. Now known
retrospectively as "small-scale integration" (SSI), improvements in technique led to
devices with hundreds of logic gates, known as large-scale integration (LSI),


systems with at least a thousand logic gates. Current technology has moved far past
this mark and today's microprocessors have many millions of gates and hundreds of
millions of individual transistors.
At one time, there was an effort to name and calibrate various levels of largescale integration above VLSI. Terms like Ultra-large-scale Integration (ULSI) were
used. But the huge number of gates and transistors available on common devices has
rendered such fine distinctions moot. Terms suggesting greater than VLSI levels of
integration are no longer in widespread use. Even VLSI is now somewhat quaint,
given the common assumption that all microprocessors are VLSI or better.
As of early 2008, billion-transistor processors are commercially available, an
example of which is Intel's Montecito Itanium chip. This is expected to become more
commonplace as semiconductor fabrication moves from the current generation of 65 nm
processes to the next 45 nm generations (while experiencing new challenges such as
increased variation across process corners).
This microprocessor is unique in the fact that its 1.4 Billion transistor count,
capable of a teraflop of performance, is almost entirely dedicated to logic (Itanium's
transistor count is largely due to the 24MB L3 cache). Current designs, as opposed to
the earliest devices, use extensive design automation and automated logic synthesis to

lay out the transistors, enabling higher levels of complexity in the resulting logic
functionality. Certain high-performance logic blocks like the SRAM cell, however, are
still designed by hand to ensure the highest efficiency (sometimes by bending or
breaking established design rules to obtain the last bit of performance by trading

2.2 What is VLSI?

VLSI stands for "Very Large Scale Integration". This is the field which involves
packing more and more logic devices into smaller and smaller areas.

Simply we say Integrated circuit is many transistors on one chip.


of extremely


complex circuitry using


semiconductor material

Integrated circuit (IC) may contain millions of transistors, each a few mm in size

Applications wide ranging: most electronic logic devices

2.2.1 History of Scale Integration

late 40s Transistor invented at Bell Labs

late 50s First IC (JK-FF by Jack Kilby at TI)

early 60s Small Scale Integration (SSI)

o 10s of transistors on a chip
o late 60s Medium Scale Integration (MSI)
o 100s of transistors on a chip

early 70s Large Scale Integration (LSI)


1000s of transistor on a chip

early 80s VLSI 10,000s of transistors on a chip (later 100,000s & now 1,000,000s)

Ultra LSI is sometimes used for 1,000,000s

2.3 Advantages of ICs over discrete components

While we will concentrate on integrated circuits, the properties of integrated
circuits-what we can and cannot efficiently put in an integrated circuit- largely
determine the architecture of the entire system.

Integrated circuits improve system

characteristics in several critical ways. ICs have three key advantages over digital
circuits built from discrete components:
Size: Integrated circuits are much smaller-both transistors and wires are shrunk to
micrometer sizes, compared


the millimeter or

centimeter scales of discrete

components. Small size leads to advantages in speed and power consumption, since
smaller components have smaller parasitic resistances, capacitances, and inductances.
Speed: Signals can be switched between logic 0 and logic 1 much quicker within a chip
than they can between chips. Communication within a chip can occur hundreds of
times faster than communication between chips on a printed circuit board. The high
speed of circuits on- chip is due to their small size-smaller components and wires have
smaller parasitic capacitances to slow down the signal.
Power consumption: Logic operations within a chip also take much less power. Once
again, lower power consumption is largely due to the small size of circuits on the chipsmaller parasitic capacitances and resistances require less power to drive them

2.4 VLSI And Systems

These advantages of integrated circuits translate into advantages at the system
Smaller physical size: Smallness is often an advantage in itself- consider portable
televisions or handheld cellular telephones.
Lower power consumption: Replacing a handful of standard parts with a single chip
reduces total power consumption. Reducing power consumption has a ripple effect on
the rest of the system: a smaller, cheaper power supply can be used; since less power
consumption means less heat, a fan may no longer be necessary; a simpler cabinet with
less shielding for electromagnetic shielding may be feasible, too.
Reduced cost: Reducing the number of components, the power supply requirements,
cabinet costs, and so on, will inevitably reduce system cost. The ripple effect of
integration is such that the cost of a system built from custom ICs can be less, even






than the


parts they replace.

Communication within a chip can occur hundreds of times faster than communication
between chips on a printed circuit board.
Understanding why integrated circuit technology has such profound influence
on the design of digital systems requires understanding both the technology of IC
manufacturing and the economics of ICs and digital systems.

2.5 Applications of VLSI

Electronic systems now perform a wide variety of tasks in daily life. Electronic
systems in some cases have replaced

mechanisms that operated


hydraulically, or by other means; electronics are usually smaller, more flexible, and
easier to service. In other cases electronic systems have created totally new applications.

Electronic systems perform a variety of tasks, some of them visible, some more hidden.
Electronic systems in cars operate stereo systems and displays; they also control fuel
injection systems, adjust suspensions to

varying terrain, and perform the control

functions required for anti-lock braking (ABS) systems.

Digital electronics compress and decompress video, even at high-definition data

rates, on-the-fly in consumer electronics.

Low-cost terminals for Web browsing still require sophisticated electronics,

despite their dedicated function.

Personal computers and workstations provide word-processing, financial analysis,

and games. Computers include both central processing units (CPUs) and specialpurpose hardware for disk access, faster screen display, etc.

Medical electronic systems measure bodily functions and perform complex

processing algorithms to warn about unusual conditions. The availability of these
complex systems, far from overwhelming consumers, only creates demand for
even more complex systems.

2.6 Conclusion
The growing sophistication of applications continually pushes the design and
manufacturing of integrated circuits and electronic systems to new levels of complexity.
And perhaps the most amazing characteristic of this collection of systems is its varietyas systems become more complex, we build not a few general-purpose computers but
an ever wider range of special-purpose systems. Our ability to do so is a testament to
our growing mastery of both integrated circuit manufacturing and design, but the


of customers







of design


3.1 Introduction
Verilog HDL is a hardware description language that can be used to model a
digital system at many levels of abstraction ranging from the algorithmic-level to the
gate-level to the switch-level. The complexity of the digital system being modeled
could vary from that of a simple gate to a complete electronic digital system, or
anything in between. The digital system can be described hierarchically and timing
can be explicitly modeled within the same description.
The Verilog HDL language includes capabilities to describe the behavior-al
nature of a design, the dataflow nature of a design, a design's structural composition,
delays and a waveform generation mechanism including aspects of response monitoring
and verification, all modeled using one single language. In addition, the language
provides a programming language interface through which the internals of a design can
be accessed during simulation including the control of a simulation run.
The language not only defines the syntax but also defines very clear simulation
semantics for each language construct. Therefore, models written in

this language

can be verified using a Verilog simulator. The language inherits many of its operator
symbols and constructs from the C programming language. Verilog HDL provides an

range of modeling capabilities,

some of which are quite difficult to

comprehend initially. However, a core subset of the language is quite easy to learn and
use. This is sufficient to model most applications.
The Verilog HDL language was first developed by Gateway Design Automation
in 1983 as hardware are modeling language for their simulator product, At that time ,it
was a proprietary language. The Verilog HDL language includes capabilities to describe
the behavior-al nature of a design, the dataflow nature of a design, a design's structural
Because of the popularity of the, simulator product, Verilog HDL gained acceptance as a
usable and practical language by a number of designers. In an effort to increase the
popularity of the language, the language was placed in the public domain in 1990.
Open Verilog International (OVI) was formed to promote Verilog. In 1992 OVI
decided to pursue standardization of Verilog HDL as an IEEE standard. This effort was
successful and the language became an IEEE standard in 1995. The complete standard is
described in the Verilog hardware description language reference manual. The standard
is called std. 1364-1995.

3.2 Major Capabilities

Listed below are the major capabilities of the Verilog hardware description:

Primitive logic gates, such as and, or and nand, are built-in into the language.

Flexibility of creating a user-defined primitive (UDP). Such a primitive could

either be a combinational logic primitive or a sequential logic primitive.

Switch-level modeling primitive gates, such as pmos and nmos, are also built- in
into the language.

A design can be modeled in three different styles or in a mixed style. These

styles are: behavioral style modeled using procedural constructs; dataflow style
- modeled using continuous assignments; and structural style modeled using
gate and module instantiations.

There are two data types in Verilog HDL; the net data type and the register
data type. The net type represents a physical connection between structural
elements while a register type represents an abstract data storage element.

Figure.3-1 shows the mixed-level modeling capability of Verilog HDL, that is, in
one design; each module may be modeled at a different level.

Figure 3.1 Mixed level modeling

Verilog HDL also has built-in logic functions such as & (bitwise-and) and I









statements, and loops are available in the language.

Notion of concurrency and time can be explicitly modeled.

Powerful file read and write capabilities fare provided.

The language is non-deterministic under certain situations, that is, a model may
produce different results on different simulators; for example, the ordering of
events on an event queue is not defined by the standard.

Synthesis is the process of constructing a gate level netlist from a registertransfer level model of a circuit described in Verilog HDL. Figure.3-2 shows such a
process. A synthesis system may as an intermediate step, generate a netlist that is
comprised of register-transfer level blocks such as flip-flops, arithmetic-logic-units,
and multiplexers, interconnected by wires. In such a case, a second program called the
RTL module builder is necessary. The purpose of this builder is to build, or acquire
from a library of predefined components, each of the required RTL blocks in the userspecified target technology.

Figure 3.2 Synthesis process

The above figure shows the basic elements of Verilog HDL and the elements
used in hardware. A mapping mechanism or a construction mechanism has to be
provided that translates the Verilog HDL elements into their corresponding hardware
elements as shown in figure.3-3

3.4 Conclusion
The Verilog HDL language includes capabilities to describe the behavior-al
nature of a design, the dataflow nature of a design, a design's structural composition,
delays and a waveform generation mechanism including aspects of response monitoring
and verification, all modeled using one single language. The language not only defines
the syntax but also defines very clear simulation semantics for each language construct.
Therefore, models written in

this language can be verified using a Verilog simulator.

The Verilog HDL language includes capabilities to describe the behavior-al nature of
a design, the dataflow nature of a design, a design's structural composition, delays.

Figure 3.3: Typical design process


4.1 Multiplier
Multipliers are key components of many high performance systems such as FIR
filters, microprocessors, digital signal processors, etc. A systems performance is
generally determined by the performance of the multiplier because the multiplier is
generally the slowest clement in the system. Furthermore, it is generally the most area
consuming. Hence, optimizing the speed and area of the multiplier is a major design
issue. However, area and speed are usually conflicting constraints so that improving
speed results mostly in larger areas. As a result, whole spectrums of multipliers with
different area-speed constraints are designed with fully parallel processing. In between
are digit serial multipliers where single digits consisting of several bits are operated on.
These multipliers have moderate performance in both speed and area. However, existing
digit serial multipliers have been plagued by complicated switching systems and/or
irregularities in design. Radix 2^n multipliers which operate on digits in a parallel fashion
instead of bits bring the pipelining to the digit level and avoid most of the above
problems. They were introduced by M. K. Ibrahim in 1993. These structures are iterative
and modular. The pipelining done at the digit level brings the benefit of constant
operation speed irrespective of the size of the multiplier. The clock speed is only
determined by the digit size which is already fixed before the design is implemented.
The growing market for fast floating-point co-processors, digital signal processing
chips, and graphics processors has created a demand for high speed, area-efficient
multipliers. Current architectures range from small, low-performance shift and add
multipliers, to large, high performance array and tree multipliers. Conventional linear
array multipliers achieve high performance in a regular structure, but require large
amounts of silicon. Tree structures achieve even higher performance than linear arrays
but the tree interconnection is more complex and less regular, making them even larger
than linear arrays. Ideally, one would want the speed benefits of a tree structure, the
regularity of an array multiplier, and the small size of a shift and add multiplier.

4.2 Background
Websters dictionary defines multiplication as a mathematical operation that at
its simplest is an abbreviated process of adding an integer to itself a specified number of
times. A number (multiplicand) is added to itself a number of times as specified by

another number (multiplier) to form a result (product). In elementary school, students

learn to multiply by placing the multiplicand on top of the multiplier. The multiplicand is
then multiplied by each digit of the multiplier beginning with the rightmost, Least
Significant Digit (LSD). Intermediate results (partial-products) are placed one atop the
other, offset by one digit to align digits of the same weight. The final product is
determined by summation of all the partial-products. Although most people think of
multiplication only in base 10, this technique applies equally to any base, including
binary. Figure 1.2.1 shows the data flow for the basic multiplication technique just
described. Each black dot represents a single digit.

Figure 4.1: Basic Multiplication Data flow

4.2.1 Binary Multiplication
In the binary number system the digits, called bits, are limited to the set. The
result of multiplying any binary number by a single binary bit is either 0, or the original
number. This makes forming the intermediate partial-products simple and efficient.
Summing these partial- products is the time consuming task for binary multipliers. One
logical approach is to form the partial-products one at a time and sum them as they are
generated. Often implemented by software on processors that do not have a hardware
multiplier, this technique works fine, but is slow because at least one machine cycle is
required to sum each additional partial-product.
For applications where this approach does not provide enough performance,
multipliers can be implemented directly in hardware.
4.2.2 Hardware Multipliers
Direct hardware implementations of shift and add multipliers can increase
performance over software synthesis, but are still quite slow. The reason is that as each
additional partial- product is summed a carry must be propagated from the least
significant bit (LSB) to the most significant bit (MSB). This carry propagation is time

consuming, and must be repeated for each partial product to be summed.

One method to increase multiplier performance is by using encoding techniques to
reduce the number of partial products to be summed. Just such a technique was first
proposed by Booth. The original Booths algorithm ships over contiguous strings of ls by
using the property that: 2 + 2(n-1) + 2(n-2) + . . . + 2hm) = 2(n+l) - 2(n-m). Although
Booths algorithm produces at most N/2 encoded partial products from an N bit operand,
the number of partial products produced varies. This has caused designers to use modified
versions of Booths algorithm for hardware multipliers. Modified 2-bit Booth encoding
halves the number of partial products to be summed.
Since the resulting encoded partial-products can then be summed using any
suitable method, modified 2 bit Booth encoding is used on most modern floating-point
chips LU 881, MCA 861. A few designers have even turned to modified 3 bit Booth
encoding, which reduces the number of partial products to be summed by a factor of three
IBEN 891. The problem with 3 bit encoding is that the
Carry-propagate addition required to form the 3X multiples often overshadows the
potential gains of 3 bit Booth encoding.
To achieve even higher performance advanced hardware multiplier architectures
search for faster and more efficient methods for summing the partial-products. Most
increase performance by eliminating the time consuming carry propagate additions. To
accomplish this, they sum the partial-products in a redundant number representation. The
advantage of a redundant representation is that two numbers, or partial-products, can be
added together without propagating a carry across the entire width of the number. Many
redundant number representations are possible. One commonly used representation is
known as carry-save form. In this redundant representation two bits, known as the carry
and sum, are used to represent each bit position. When two numbers in carry-save form
are added together any carries that result are never propagated more than one bit position.
This makes adding two numbers in carry-save form much faster than adding two normal
binary numbers where a carry may propagate. One common method that has been
developed for summing rows of partial products using a carry-save representation is the
array multiplier.
4.2.3 Array Multipliers
Conventional linear array multipliers consist of rows of carry-save adders (CSA).
A portion of an array multiplier with the associated routing can be seen in Figure 4.2.


Figure 4.2: Two Rows of an Array Multiplier

In a linear array multiplier, as the data propagates down through the array, each
row of CSAs adds one additional partial-product to the partial sum. Since the
intermediate partial sum is kept in a redundant, carry-save form there is no carry
propagation. This means that the delay of an array multiplier is only dependent upon the
depth of the array, and is independent of the partial-product width. Linear array
multipliers are also regular, consisting of replicated rows of CSAs. Their high
performance and regular structure have perpetuated the use of array multipliers for VLSI
math co-processors and special purpose DSP chips.
The biggest problem with full linear array multipliers is that they are very large.
As operand sizes increase, linear arrays grow in size at a rate equal to the square of the
operand size. This is because the number of rows in the array is equal to the length of the
multiplier, with the width of each row equal to the width of multiplicand. The large size
of full arrays typically prohibits their use, except for small operand sizes, or on special
purpose math chips where a major portion of the silicon area can be assigned to the
multiplier array.
Another problem with array multipliers is that the hardware is underutilized. As
the sum is propagated down through the array, each row of CSAs computes a result only
once, when the active computation front passes that row. Thus, the hardware is doing
useful work only a very small percentage of the time. This low hardware utilization in
conventional linear array multipliers makes performance gains possible through increased
efficiency. For example, by overlapping calculations pipelining can achieve a large gain
in throughput Figure 4.3 shows a full array pipelined after each row of CSAs. Once the
partial sum has passed the first row of CSAs, represented by the shaded row of GSAs in

cycle 1, a subsequent multiply can be started on the next cycle. In cycle 2, the first partial
sum has passed to the second row of CMs, and the second multiply, represented by the
cross hatched row of CSAs, has begun. Although pipelining a full array can greatly
increase throughput, both the size and latency are increased due to the additional latches
While high throughput is desirable, for general purpose computers size and latency tend
to be more important; thus, fully pipelined linear array multipliers are seldom found.

Figure 4.3: Data Flow through a Pipelined Array Multiplier

4.3 Variations in Multipliers

We do not always synthesize our multipliers from scratch but may desire, or be
required, to use building blocks such as adders, small multipliers, or lookup tables.
Furthermore, limited chip area and/or pin availability may dictate the use of bit-serial
designs. In this chapter, we discuss such variations and also deal with modular
multipliers, the special case of squaring, and multiply-accumulators.

Divide-and-Conquer Designs

Additive Multiply Modules

Bit-Serial Multipliers

Modular Multipliers

The Special Case of Squaring

Combined Multiply-Add Units


4.4 Bit-serial Multipliers

Bit-serial arithmetic is attractive in view of its smaller pin count, reduced wire
length, and lower floor space requirements in VLSI. In fact, the compactness of the
design may allow us to run a bit-serial multiplier at a clock rate high enough to make the
unit almost competitive with much more complex designs with regard to speed. In
addition, in certain application contexts inputs are supplied bit-serially anyway. In such a
case, using a parallel multiplier would be quite wasteful, since the parallelism may not
lead to any speed benefit. Furthermore, in applications that call for a large number of
independent multiplications, multiple bit-serial multipliers may be more cost-effective
than a complex highly pipelined unit.

Figure 4.4: Bit-serial multiplier; 4x4 multiplication in 8 clock cycles

Bit-serial multipliers can be designed as systolic arrays: synchronous arrays of
processing elements that are interconnected by only short, local wires thus allowing very
high clock rates. Let us begin by introducing a semisystolic multiplier, so named because
its design involves broadcasting a single bit of the multiplier x to a number of circuit
elements, thus violating the short, local wires requirement of pure systolic design.
Figure 4.4 shows a semisystolic 4 x 4 multiplier. The multiplicand a is supplied in
parallel from above and the multiplier x is supplied bit-serially from the right, with its
least significant bit arriving first. Each bit x i of the multiplier is multiplied by a and the

result added to the cumulative partial product, kept in carry-save form in the carry and
sum latches. The carry bit stays in its current position, while the sum bit is passed on to
the neighboring cell on the right. This corresponds to shifting the partial product to the
right before the next addition step (normally the sum bit would stay put and the carry bit
would be shifted to the left). Bits of the result emerge serially from the right as they
become available.
A k-bit unsigned multiplier x must be padded with k zeros to allow the carries to
propagate to the output, yielding the correct 2k-bit product. Thus, the semisystolic
multiplier of Figure 4.4 can perform one k x k unsigned integer multiplication every 2k
clock cycles. If k-bit fractions need to be multiplied, the first k output bits are discarded
or used to properly round the most significant k bits.
To make the multiplier of Figure 4.4 fully systolic, we must remove the
broadcasting of the multiplier bits. This can be accomplished by a process known as
systolic retiming, which is briefly explained below
Consider a synchronous (clocked) circuit, with each line between two functional
parts having an integral number of unit delays (possibly 0). Then, if we cut the circuit into
two parts CL and CR, we can delay (advance) all the signals going in one direction and
advance (delay) the ones going in the opposite direction by the same amount without
affecting the correct functioning or external timing relations of the circuit. Of course, the
primary inputs and outputs to the two parts CL and cg must be correspondingly advanced
or delayed, too.
For the retiming to be possible, all the signals that are advanced by d must have
had original delays of d or more (negative delays are not allowed). Note that all the
signals going into CL have been delayed by d time units. Thus, CL will work as before,
except that everything, including output production, occurs d time units later than before
retiming. Advancing the outputs by d time units will keep the external view of the circuit
We apply the preceding process to the multiplier circuit of Figure 4.4 in three
successive steps corresponding to cuts 1, 2, and 3, each time delaying the left-moving
signal by one unit and advancing the right-moving signal by one unit. Verifying that the
multiplier in Fig. 12.9 works correctly is left as an exercise. This new version of our
multiplier does not have the fan-out problem of the design in Figure 4.4 but it suffers
from long signal propagation delay through the four FAs in each clock cycle, leading to
inferior operating speed. Note that the culprits are zero-delay lines that lead to signal
propagation through multiple circuit elements.

One way of avoiding zero-delay lines in our design is to begin by doubling all the
delays in Figure 4.4. This is done by simply replacing each of the sum and carry flip-flops
with two cascaded flip-flops before retiming is applied. Since the circuit is now operating
at half its original speed, the multiplier x must also be applied on alternate clock cycles.
The resulting design is fully systolic, inasmuch as signals move only between adjacent
cells in each clock cycle. However, twice as many cycles are needed.
The easiest way to derive a multiplier with both inputs entering bit-serially is to
allow k clock ticks for the multiplicand bits to be put into place in a shift register and then
use the design of Figure 4.4 to compute the product. This increases the total delay by k
Figure 4.5 uses dot notation to show the justification for the bit-serial multiplier
design above. Figure 4.5 depicts the meanings of the various partial operands and results.

Figure 4.5: Bit Serial multiplier design in dot notation


5.1 Tools Used
1) Pc installed with linux operating system
2) Installed cadence tools:

Ncvlog For checking errors

Ncverilog For execution of code

Simvision To View waveforms

5.2 Coding Steps

1) Create directory structure for the project as below

Figure 5.1: Project directory structure

2) Write RTL code in a text file and save it as .v extension in RTL directory
3) Write code for testbench and store in TB directory

5.3 Simulation steps

The Commands that are used in cadence for the execution are
1) Initially we should mount the server using mount -a.
2) Go to the C environment with the command csh //c shell.
3) The source file should be opened by the command source /root/cshrc.
4) The next command is to go to the directory of cadence_dgital_labs
#cd .../../cadence_digital_labs/
5) Then check the file for errors by the command ncvlog ../rtl/filename.v -mess.
6) Then execute the file using ncverilog +access +rwc ../rtl/filename.v ../tb/file_tb.v
+nctimescale +1ns/1ps
Rwc read write command Gui- graphical unit interface
7) After running the program we open simulation window by command simvision


Figure 5.2: Simulation window

8) After the simulation the waveforms are shown in the other window.

Figure 5.3: Waveform window

5.4 Full adder code

module fulladder(output reg cout,sum,input a,b,cin,rst);
always@(posedge rst)


5.5 Full adder flowchart

Figure 5.4: Full adder flowchart

5.6 Full adder testbench

module full_adder_tb;
wire cout,sum;
reg a,b,cin,rst;
fulladder fa(cout,sum,a,b,cin,rst);
#2 rst=1'b1;
#(period/2) rst=1'b0;
#5 a=1'b0;


5.7 Bit-serial multiplier algorithm

Figure 5.5: Bit-Serial multiplier flowchart

5.8 Bit-Serial multiplier code

module serial_mult(output product,input [3:0] a,input b,clk,rst);
wire s1,s2,s3;
reg s1o,s2o,s3o; //latches for sum at various stages
wire c0,c1,c2,c3;
reg c0o,c1o,c2o,c3o;//latches for carry at various stages
wire a3o,a2o,a1o,a0o;
reg s;
fulladder fa0(c0,product,a0o,s1o,c0o,rst);
fulladder fa1(c1,s1,a1o,s2o,c1o,rst);
fulladder fa2(c2,s2,a2o,s3o,c2o,rst);
fulladder fa3(c3,s3,a3o,s,c3o,rst);
and n0(a0o,a[0],b);
and n1(a1o,a[1],b);
and n2(a2o,a[2],b);
and n3(a3o,a[3],b);
always@(posedge clk, posedge rst)


else //moving all sums to reg

5.9 Full adder waveform

Figure 5.6: Full adder output waveforms

5.10 Bit-serial multiplier testbench

module serial_mult_tb;
reg [3:0] a;
reg b;
wire product;
reg clk,rst;
parameter period=10;
serial_mult dut(product,a,b,clk,rst); //dut


initial clk=0;
always #period clk=~clk;
#2 rst=1'b1;
#(period/2) rst=1'b0;
@(posedge clk) b=0;
@(posedge clk) b=0;
@(posedge clk) b=1;
@(posedge clk) b=0;
@(posedge clk) b=0;
@(posedge clk) b=0;
@(posedge clk) b=0;
#period $finish;

5.11 Bit-serial multiplier waveforms

Figure 5.7: Bit serial multiplier input/output waveforms

Figure 5.8: Bit serial multiplier with intermediate waveforms


Multipliers play an important role in todays digital signal processing and various
other applications. With advances in technology, many researchers have tried and are
trying to design multipliers which offer either of the following design targets high
speed, low power consumption, regularity of layout and hence less area or even
combination of them in one multiplier thus making them suitable for various high speed,
low power and compact VLSI implementation. The common multiplication method is
add and shift algorithm. In parallel multipliers number of partial products to be added is
the main parameter that determines the performance of the multiplier. To reduce the
number of partial products to be added, Modified Booth algorithm is one of the most
popular algorithms. To achieve speed improvements Wallace Tree algorithm can be used
to reduce the number of sequential adding stages. Further by combining both Modified
Booth algorithm and Wallace Tree technique we can see advantage of both algorithms in
one multiplier. However with increasing parallelism, the amount of shifts between the
partial products and intermediate sums to be added will increase which may result in
reduced speed, increase in silicon area due to irregularity of structure and also increased
power consumption due to increase in interconnect resulting from complex routing. On



serial-parallel multipliers compromise speed


achieve better

performance for area and power consumption. The selection of a parallel or serial
multiplier actually depends on the nature of application.
A key challenge facing current and future computer designers is to reverse the
trend by removing layer after layer of complexity, opting instead for clean, robust, and
easily certifiable designs, while continuing to try to devise novel methods for gaining
performance and ease-of-use benefits from simpler circuits that can be readily adapted to
application requirements.
This is achieved by using Bit Serial multipliers.


[1] Behrooz Parhami, Computer arithmetic: algorithms and hardware designs, Oxford
University Press, 2009
[2] F. Sadiq M. Sait, Gerhard Beckoff, A Novel Technique for Fast Multiplication.
IEEE Fourteenth Annual International Phoenix Conference on Computers and
Communications, vol. 7803-2492-7, pp. 109-114, 1995.
[3] Ghest, C., Multiplying Made Easy for Digital Assemblies, Electronics, Vol. 44,
pp.56-61. November 22. 1971.
[4] Ienne, P., and M. A. Viredaz, Bit-Seria1 Multipliers and Squarers, IEEE Trans.
Computers, Vol. 43, No. 12, pp. 1445-1450, 1994
[5] Samir Palnitkar, Verilog HDL: A Guide to Digital Design and Synthesis, Prentice
Hall Professional, 2003