Ohiou1183468981 PDF

[LOOK-AHEAD INSTRUCTION SCHEDULING FOR DYNAMIC
.EXECUTION IN PIPELINED COMPUTERSj
A Thesis Presented to
The Faculty of the College of Engineering and Technology
Ohio University
In Partial Fulfillment
of the Requirements for the Degree
Master of Science
by
Vij ay K. Reddy Anam,
+.-
June, 1990
TABLE OF CONTENTS
CHAPTER I
Introduction
CHAPTER I1
Design of Look-Ahead Pipelined
Computer System
2.1
Introduction
2.2
Dynamic Instruction Scheduling
2.3
Reducing Branch Penalty
2.4
Hardware System
CHAPTER I11
Design of Dynamic Pipelined
Arithmetic Unit
3.1
Introduction
3.2
Principle of Operation of
the CSA Tree
3.3
Conversion of Unifunction
Pipeline to Multifunction Pipeline
3.4
Dynamic Execution of Instructions
CHAPTER IV
Instruction Execution in the Pipeline
System
CHAPTER V
Computer Simulation and Experimental
Results
5.1
Functions Emulating the Stages

of the PIU
5.2
Functions Emulating the Stages

of the PEU
217
5.3
Control of the Pipeline
219
5.4
Computer Generation of the State Diagrams
223
5.5
Experimental Results
225
CHAPTER VI
Conclusions and Discussions
REFERENCES
APPENDIX
A.
State Matrices
B.
Computer program to Generate State

Matrices
Simulation Program
CHAPTER ONE
INTRODUCTION
The advances in computer technology are leading to the

advent of high speed computers which are cost effective and
faster than their predecessors. Main frame machines like the
Texas Instruments TI-ASC, IBM System/360 Model 91 and 195,
Burroughs PEPE, CRAY
1, CDC STAR-100, CDC 6600 and CDC
7600 have to a large extent pipeline processing capabilities

in their instruction and arithmetic units or in the form of
pipelined special purpose functional units [l-41.
Pipelining is a way of imbedding parallelism in a
system. The principle of pipelining is to partition a
process into several subprocesses and execute these
subprocesses concurrently in dedicated individual units.
This is analogous to the operation of an assembly line in
the automotive industry. In a non-pipelined computer system,
the execution of an instruction involves the
following
processes: 1) fetching the instruction, 2) decoding the

instruction , 3) fetching the operands, and 4) executing the
instruction. In a pipelined system, instruction execution
can be split into four subprocesses which are performed by
dedicated units functioning concurrently. The advantage of
this operation is that while a unit is operating on an
instruction, the immediately preceding unit can be operating
on the next instruction and so on. Thus the throughput of

a pipelined system is much higher than a non-pipelined
system. The overlapped execution is depicted in a space time
diagram in Fig. 1.1.
The second generation and earlier computers employed
arithmetic and logic units which were unsophisticated and
under-utilized. The introduction of pipeline techniques in
t h e processor design necessitated t h e advent of new
algorithms to control the instruction flow and resolve any
hazards that might arise in execution of instructions.
Several look-ahead algorithms have been proposed with the
capabilities of executing more than one instruction at the
same time. These algorithms were successfully employed in
many third generation computers involving multiple execution
units.
T h e look-ahead algorithms were designed at the
processor level and involved the following common tasks: 1)
detecting the instructions that can be executed
concurrently, 2) issuing the instructions to the functional
units, and 3 ) assigning of t h e registers t o various
operands. The ideal throughput is difficult to achieve due
to dependencies within the instructions of a program. The
data dependencies have to be resolved by either scheduling
t h e e x e c u t i o n of t h e instruction o r by placing t h e
instruction into a buffer and monitoring the registers for
resolving the instructional dependencies. Tomasulo [5] has
PIPELINE CYCLES
lnstruction
lnstruction
lnstruction
lnstruction
lnstruction
lnstruction
lnstruction
1
2
3
4
5
6
7
a) The structure of a general pipeline computer

b) The ideal flow of instructions in time space.
Fig 1.1
Ideal operation of a pipelined computer system
proposed an algorithm to resolve the dependency situation

by creating a reservation station (RS) to hold instructions
that are awaiting execution. Instructions remain in RS until
the operand conflicts are resolved. The RS monitors a common
data bus and captures the operands for the instructions as
they become available. The instruction identifies its
operands by an address tag scheme. Each source register is
assigned a ready bit which determines the usage of a
register. A register is set busy, if it is the destination
of an instruction in execution. If a source register is busy
when an instruction reaches the issue stage, the tag of the
register is obtained and attached to the instruction and the
instruction is forwarded to the RS. If a sink register is
busy, then a new tag is attached to the instruction against
the sink register and the tag is updated on that register.
This system is expensive to implement. Each register has to
be tagged and each tag needs an associative comparison
hardware to carry out the tag matching process. The problem
is compounded if the number of registers is large. Sohi and
Vajapeyam [6] have modified and extended T u m a s u l o ~ s
algorithm
for
CRAY
1 system. The modifications were made
to reduce the hardware needed for tagging a large bank of

registers. The tags are all consolidated into a tag unit
(TU). The tags are issued to registers from the TU unit and
are returned t o the common pool as soon as the tag is
released. The reservation stations are combined into a
common RS pool and instructions are issued to various

functional units as they become ready. This scheme relies
on the tag comparing hardware for proper execution and still
requires a large number of
register tags for all its
registers. In both the algorithms [5] and [6], associative

tags are compared while forwarding a single instruction. If
the instructions that are awaiting execution are large in
number, the process of associative comparison is time
consuming and cannot be avoided. Keller [7] proved that
optimality of resolving dependencies could be achieved by
a control scheme that employs first-in first-out (FIFO)
queues. Unlike the previous algorithms, these queues
eliminate the associative search process. Each queue is
associated with each pair of conflicting operations. An
operation will belong to a queue if it is an operation that
is associated with that queue. The elements stored in the
queue are represented by tokens. Each operation involves a
distinct token. When an instruction enters the issue stage,
it places a token at the tail of each queue to corresponding
to the operation. Before an operation begins, there must be
a corresponding token at the head of each queue to which the
operation belongs. When the operation is completed, the
tokens are removed. Each queue is implemented as a link
list. The disadvantage of this scheme is that if there are
m different binary functions and n different registers, the
number of queues would be (m*n14. Dennis [8] proposed a
similar queuing scheme with substantially lesser queues. The

queues are not FIFO in nature. Each queue corresponds to a
single register. Token interchanging can occur in a
nondeterministic fashion, casts doubts on the efficiency of
such an implementation. Tjaden and Flynn [9] have proposed
a scheme wherein a block of M instructions can be executed
simultaneously. The scheme analyses the dependencies of a
block of instructions and issues a set of independent
instructions for execution.
This scheme has two
constraints: 1) it cannot handle indirect addressing, 2) the

source operands, the sink result, and the next instruction
must be specified by defining their locations in
storage.
Ramamoorthy and Kim [lo] have proposed a scheme called

dynamic sequencing and segmentation model (DSSM) for
efficient sequencing of instructions with very low
overheads. The overheads are reduced by overlapping the
unproductive administrative and bookkeeping computations
with the execution of computational tasks. The end result
is the efficient exploitation of parallelism. Smith and
Weiss[ll] have proposed a modified scheme of Thorton's
algorithm [ 1 2 ] for the Cray-1 system. In this algorithm,
dynamic scheduling is adopted and the associative tag
comparisons are eliminated.
The effectiveness of the above mentioned schemes are
dependent on the availability of functional units.
This
problem is alleviated by providing repetitive functional
u n i t s a s p r o v i d e d i n t h e T I - A S C c o m p u t e r [ 1 3 ] and
reconfiguring the units as needed. The general approach is
to provide a static functional unit for each class of
operations. Static functional units can execute instructions
only when the operation defined by the instruction fall
within the same class for which the unit was designed. The
Astronautics ZS-1 [14] operates on a decoupled architecture
and supports two instruction streams. This machine is
capable of forwarding two instructions to the execution
units within a clock period. The dependent instructions are
held at the issue stage until the dependency is resolved.
The two streams are unequal in length and are supported by
multiple static execution units. Data can be copied between
the two units via a copy unit. Queues are used for memory
operands providing a flexible way of relating the memory
access functions and floating point operations. This
provides a dynamic allocation of memory access functions
ahead of t h e floating point operations. There is no
reordering of instructions within a pipeline.
In this research a system is developed which executes
instructions dynamically. The hardware is a pipelined system
consisting of two fundamental sub-systems: the pipelined
instruction unit (PIU), and the pipelined execution unit
(PEU). The PIU can further be divided into the fetch unit
(FTJ),
the decode unit (DU), and the issue unit (IU). The PEU
is also divided into the dynamic arithmetic unit (DAU) and
Main Memory Module
r-l
Fetch Unit
Unit II
Unit I
Fig. 1.2 Proposed pipeline system shown with the sub units
the logic unit (LU)
. The overall
system configuration is
illustrated in Fig. 1.2. The operation of the system assumes

no shuffling of instructions by any compiler. The hardware
supports two instruction streams which are necessary for
executing branch instructions. The DAU can execute
three
different arithmetic operations independently within the

same pipeline cycle. This improves the performance over a
similar static unit capable of executing a single operation
at a time. A simple tagging system is used to resolve the
dependency within instructions. There are no associative
comparisons necessary in this algorithm. The instructions
are held in delay stations (DS) present in the stages of the
execution units. An instruction is held in a stage only if
it needs the missing operand to enter the next stage. The
data is fed to the DS via a common data bus (CDB).
The remaining of this thesis is organized in six
chapters. Chapter I1 introduces the system and explains the
function of each sub-system along with the scheduling of
instructions. Chapter I11 describes the operation and the
design of the DAU. It also includes the generation of state
diagrams, to predict the latencies and to schedule the
execution of instructions in the DAU. Chapter IV explains
the operation of the proposed system. Chapter V deals with
the computer simulation of the system and the experimental
results. Chapter VI includes discussion and conclusions.
CHAPTER TWO
DESIGN OF THE LOOK-AHEAD PIPELINED COMPUTER SYSTEM
2.1
INTRODUCTION:
As stated in Chapter 1 sequential computers are not

efficient in utilizing their resources. The serial design
principles do not allow any independency to the functional
units present in the central processing unit (CPU). The
instructions are executed serially one at a time. There is
no overlap between two successive instructions in the
execution phase. This leads to many of the functional units
being idle most of the time. The new generation complex
instruction set computers (CISC) such as the Intel 80286,
80386, 80486, Motorola 68020, 68030 have incorporated
pipelining techniques at the fetch level. The general
pipeline system consists of stages devoted to fetch, decode,
issue and execute. These stages operate concurrently.
Elements are provided between the stages to synchronize the
flow of data from one stage to another. This could also be
achieved by incorporating these elements as a part of each
stage. At the beginning of every pipeline cycle, each stage
receives data from the previous stage. The data is processed
and the result is forwarded to the next stage, at the end
of the cycle. During the cycle, the output of a stage will
contain the result obtained from processing the data of the
11
previous cycle. It will change to the current result only

at the end of the current cycle. This is necessary so as to
prevent the result of one stage preemptively influencing the
operations of the next stage. The process is shown in a time
space diagram in Fig. 2.1.
The pipelined system proposed in this research is an
instruction look-ahead system, which consists of four
fundamental units. The system is illustrated in Fig. 2.2.
The first three units comprise the pipelined instruction
unit (PIU) and the last unit is the pipelined execution unit
(PEU). The PIU consists of the following units: the fetch
unit, the decode unit, and the issue unit. The execution
unit is made up of the pipelined arithmetic unit (PAU) and
the logic unit (LU) The arithmetic unit is subdivided into

the dynamic fixed point arithmetic unit (DAU) and the
dynamic floating point arithmetic unit (FPAU) The pipeline

arithmetic units consists of seven stages and can perform
the operations of addittion, subtraction, multiplication and
division. The dynamic nature of the arithmetic unit is
exploited by t h e system t o initiate m o r e t h a n o n e
instruction in a single pipeline cycle. The individual
operations take different amounts of time to execute. The
table listed in Fig. 2.3 lists the execution time of the
various arithmetic and logic operations. The design of the
PAU is described in more detail in Chapter 3. The design of
the LU and FPAU is left for further research.
Pipeline cycle # 0
Latch 1
Fetch
unit
Latch 2
Latch 3
Decode
Issue
unit
'
I
,unit
lnstr 1
1
Pipeline cycle # 1
Latch 1
I
Fetch
unit
lnstr 2
I
n
s
t
* +
r
1
Latch 2
Decode
unit
lnstr 1
Latch 3
lssue
unit
- u
Pipeline cycle # 2
Latch 1
lnstr 3
+ st ,L
lnstr 2
r
2
u
Fig.
2.1
Time
Decode
unit
- u
Latch 3
I
n
I
Fetch
unit
Latch 2
Issue
unit
s
* t +
lnstr 1
r
1
space diagram of instruction flow in a pipeline system
Fetch unit
Decode unit
Issue unit
Logic unit
Fixed point
Floating point
arithmetic unit
arithmetic unit
Fig. 2.2
The pipeline system with the various units
Instruction
Instruction Type
Add / Subtract
Arithmetic
Multiplication
Arithmetic
Division
Arithmetic
23
Store / Load
Logic
And / Or / Not
Logic
Table
Fig.
2.3
Execution
Time
2.1
Instructions and their execution times
15
The performance of a pipeline is dependent on the order

o f t h e instructions in the instruction stream. If
consecutive instructions have data and control dependencies
and contend for the same resources then hazards will develop
in the pipeline system and the performance will suffer. To
improve performance, it is often possible to schedule the
instructions, so that the dependencies and resource
conflicts are resolved. There are two different ways that
instruction scheduling can be carried out. Firstly, it can
be done at compile time by the compiler or the linker. This
is referred as static scheduling because it does not change
as the program is being executed. Secondly, it can be done
by hardware at the execution time. This is referred to as
dynamic scheduling. Most compilers for pipelined processors
do some sort of static scheduling. The static scheduling
does not have any information about the dependencies and
hence the optimization is highly relative to the type of
program being compiled. Dynamic scheduling on the other hand
is independent of the compiled instruction code and can take
advantage of the dependency information at the time of
issue. The dependency information is not available during
the compile time. In this research a dynamic instruction
scheduling algorithm is proposed based on the execution time
periods of
instructions. The rest of this chapter is
organized in three main sections: 1) dynamic instruction

scheduling, 2) reducing branch overheads, and 3) the
hardware system.
2.2
DYNAMIC INSTRUCTION SCHEDULING:
The main objective of the scheduling algorithm is to

overcome the four main hazards: 1) read after write (RAW),
2) write after write (WAW), 3) write after read (WAR), and
4) operational hazard. Their significance is worth more
elaborate explanation. The registers and memory are known
as resources. A RAW hazard occurs when an instructions tries
to read a resource that has not completed its last write
process. A WAW hazard occurs, if an instruction attempts to
write into a resource that has yet to complete its previous
write operation. A WAR hazard occurs when an instruction
tries to write into a resource which has not completed its
previous read operation.
Consider the following instructions:
load
.....
load
.....
r3, (A);
.
..
r2, (B) ;
I
add
store
rl, r2, r3;

( X ) , rl;
load
rl;
.....
A potential RAW hazard can occur if the add

instruction is executed before the load instructions could
update either r3 or r2. The add instruction may receive a
value that is outdated, if executed. The hazard is
illustrated in Fig. 2.4. A WAR hazard can occur if the third
load instruction is overlapped with the add instruction. In
17
this case the resource (X) will be loaded with the result
of the third load instruction, before the store instruction
c o u l d a c c e s s rl. I n s i m p l e r t e r m s , t h e t h i r d load
instruction will reinitialize rl soon after the add
instruction has initialized it. These events would take
place before the store instruction access rl. The hazard is
illustrated in Fig. 2.5. A WAW hazard occurs when the third
load instruction updates rl before the add instruction. This
is shown in Fig. 2.6. The operational hazard takes place if
more than one instruction attempts to use the facilities of
a particular stage during the same pipeline cycle. The

common form of this hazard is that two instructions are
scheduled to start execution at the same time from the same
stage. This hazard can be eliminated by using the state
matrix of the functional pipeline unit, to schedule the
execution of instructions during initiation, into the
arithmetic unit.
2.2.1
RESOLVING THE HAZARDS:
The number of pipeline cycles that an instruction needs

t o complete execution is fixed by the design of the
execution unit. This information is used as the basis for
scheduling the instructions. The instruction scheduling is
carried out by the issue unit and the execution unit. The
issue unit schedules the instruction to eliminate the RAW,
WAW and WAR hazards. The execution unit schedules the
instruction to eliminate operational hazard.
Consider the set of instructions listed below:

load
load
mult
store
add
store
load
load
mult
store
rl, ( X ) ;
r2, (Y);
r3, rl, r2;
( Z ) , r3;
r3, rl, r2;
(U)I r3
r4, ( B ) ;
r5, (D) ;
r3, r4, r5;
(V) 1 r3
rl <-- (X)
r2 <-- (Y)
r3 <-- rl + r2
(Z) <-- r3
r3 <-- rl + r2
( C ) <--r3
r5 <-- (B)
r5 <-- (D)
r3 <-- r4 * r5
(C) <-- r3
The Fig. 2.7 illustrates the ideal flow for the

above s e t of instructions.
T h e d o m a i n D(1) o f a n
instruction is defined as the set of resource objects that

m a y e f f e c t t h e i n s t r u c t i o n I. T h e r a n g e R(1)
instruction is defined as the set of resource objects that
are modified by the instruction I. A RAW hazard between an
instruction I and J will be present, if the intersection
between R(1) and D(J) is not a null set. A WAW hazard will
occur, if the intersection between R(1) and R(J) is not a
null set. A WAR hazard will occur if the intersection
between D (I) and R(J) is not a null set. Tabulating the
conditions below:
R(1)
R(I)
D(I)
11
D(J) =
R(J) =
R(J) =
0
0
for RAW
for WAW
for WAR
(2 1)
(2.2)
(2.3)
The hazards that arise when the instruction flow is

i d e a l i s s h o w n in F i g . 2.8. A h a z a r d f r e e f l o w i s
illustrated in Fig. 2.9. This flow is achieved by scheduling
the execution of the instructions. A time window is provided
for each instruction to start and complete its execution.
load R3, (A);
load R2, (B);
add R 1, R2, R3;
0
E
ADD instruction is issued for execution while the previous two instructions
are in execution.
Fig. 2.4 Occurence of RAW hazard.
add R1, R2, R3;
store (X), R1;
load R1, (C);
C
E
STORE instruction is issued after the LOAD instruction has completed execution
Fig. 2.5
Occurence of WAR hazard
add R1, R2, R3;

store (X), R1;
load R1, (C);
LOAD instruction completes execution before ADD instruction

Fig. 2.6
Occurance of WAW hazard
RAW hazards In ideal flow
load r l ,
(X); F D l E E ~ E EE
F D I EEEEEE
load r2, (Y);

mult 3
RAW hazard between the

multiplication instruction
and the previous two load
instructions.
F Dl E EE E E E E E
mult 3 r 2 ;
store (Z), r3;

store and the
multiplication instruction.
FDI E
load r l ,
(X); F D I E E E
load r2,
(Y);

add instruction and the
previous two load
instructions
FDI EE
add r3, r l , r2;
add r3, r l , r2;

store (U), r3;
load r4,
FDI EEE
FDl EEEEEE
(B);
FDI
FDl EEEEEE
load r5, (D);

mutt
F DI
r3, r4, r5;
E EEEEEEE

store and the add
instruc80n

mult instruction and the
previous two load
instructions
lJ
mult
EEEEEE
r3, r4, r5;
store ( V ) ,
EEE
r3;
RAWhazardbetweenthe
store and the multipllcabon
instrucbon.
WAW hazards in Ideal flow
mult
add r3, r l , r2;
Fig. 2.8
F D I
WAW hazard between the

store and the multiplication
instruction.
The various hazards in the ideal instructional flow
Pipeline
Cycles
load r l , ( X ) ;
load r2, (Y);
mult
r3, r1, r2;
store (Z),
r3;
add r3, r l , r2;
store ( U ) , r3;
load r4, (B);
load r5, ( D ) ;
mult
r3, r4, r5;
store ( V ) ,
Fig. 2.10
r3;
Alloting the time window for hazard free execution of instructions
25
The Fig. 2.9 is modified to show the time window in Fig.

2.10. The execution time is fixed for an instruction.
Considering Fig. 2.9, the RAW hazard between the instruction
1 and instruction 3 will be resolved as soon as instruction
1 completes execution. The same argument can be applied
between instruction 2 and instruction 3. Basically, the

condition listed in equation 2.1 must be false. To schedule
the instruction 3 it is necessary to know the time that the
instructions 1 and 2 would complete execution. This is
illustrated in Fig. 2.11.
Instruction 4 depends on
instruction 3 which in turn depends on instruction 1 and 2.

Fig. 2.12 illustrates the resolving of RAW hazard between
instruction 1 to instruction 4.It is possible to generalize
that when the instruction I initializes a resource R and an
instruction I+k (k>O) reinitializes the same resource, then
all the instructions between I and I+k will be dependent on
instruction I for the resource R. This dependency will last
as long as the instruction I is in the process of execution.
The concept is illustrated in Fig 2.5~.Thus the time when
instruction I would complete execution is important to
schedule the dependent instructions. In our example, the
time when R1, R2 and R3 would be initialized determines the
exact time window for execution of instructions 3 and 4.
Instruction 3 is delayed for execution until R1 and R2 have
been initialized. Similarly, instruction 4 is delayed until
R3 is updated.
0 Start o f execution
Issue cycle
1 End of execution
Fig. 2.1 1
The various events in pipelined execution of instructions
pipeline cydes
load R1, (X);
9 1 0 1 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 2 0 2'
load R2, (Y);
mutt R3, R1, R2;
E E
E E
store (Z), R3;
Issue cycle
0 Start of execution
0 End of execution
Fig. 2.12
Resolving of RAW hazard from instruction 1 to instruction 4
F D I E E E E E E
store (U), r4;

load r4,
store ( V ) ,
(B);
r3;
F D l E E E E E E
F D I E E E E E E
F D I E E E E E E
The instructions between the two horizontal lines are dependent on the first
multiplication instruction. These instructions will have to be scheduled depending
on the availability of the result of the multiplication instruction.
Fig. 2.13
The hifhlighted instructions are dependent on register R3
28
The times when the four instructions would complete

execution are also shown in Fig. 2.13. If no scheduling is
carried out, instruction 3 will be issued during the fifth
pipeline cycle. At this time, instruction 1 would require
two cycles and instruction 2 would require three cycles to
complete execution. To resolve the resource conflicts,
pointers are associated with each resource and these are
used to monitor the write processes of each resource. Let
pointers C1 to CN represent pointers that are associated in
a one-to-one correspondence with the registers R 1 to RN.
Each time an instruction is issued for execution, the
pointer corresponding to the sink resource is loaded with
the time that the result would be placed into the resource.
If vp represent the value of the pointer, then the value of
vp is numerically equal to the difference between the time
of issue and the time that the sink resource (associated
with this pointer) is updated with the result. For example,
if the instruction is issued for execution during the fourth
cycle and the result of the instruction will be available
in the sink register during the seventeenth cycle then vp
would be equal to thirteen as shown below:
vp = 17
4 = 13 pipeline cycles.
During the cycles that follow, the contents of the pointer

is decremented by a single step in each cycle. This is due
to the fact that the instruction is one step closer to
completion of execution with each passing pipeline cycle.
29
Thus when vp reaches zero, the result will be available in

the sink register (associated with the pointer).
Initialising t h e pointers at the time of issue and
monitoring the pointers, will give the information as to
when the write process to the resource (associated to the
pointer) is completed. Hence the instructions that are
dependent on this resource, will have to be delayed until
vp decrements to zero. Considering the instruction set, the
value present in C1 will be equal to 2 and C2 will be equal
to 3 , at the time instruction 3 is issued. Fig. 2.14
illustrates the pointer values during the ideal instruction
flow. Fig. 2.15 illustrates how Fig. 2.14 is modified to
obtain Fig. 2.13. Using the pointers to denote the time when
each resource would complete its recent update, the
algorithm is developed as follows.
Let the instruction stream be represented by a set of
instructions:
IS
I1,I2,I3,I 4 , .
. . . .In } , where
the
maximum number of instructions in the window in memory at

any pipeline cycle. Let the registers in the system be
represented as a register set R
= {
r1,r2,r3,.....
)'
represents the total number of registers in the system. Let

C
= {
. . . . , c,
c1,c2,c3,
represent a set of counters
(pointers) that are assigned to the register set. There is

a one-to-one correspondence between the counters and the
registers. A counter is assigned to a single register and
vice versa. For simplicity, we assume that a counter denoted
Pipeline
Cycles
load r l ,
(X);
Polnter
C1
load r2, (Y);
Pointer
C2
mult
r3, r l , (2;
Pointer
C3
store (Z),
r3;
Fig. 2.14
Pointer values associated with the sink registers.
32
by
subscript j is assigned to the register with the
subscript j . Each counter carries the information about the

number of pipeline cycles that are needed for the assigned
register to assume the new value, when initialized by the
most recent instruction. The algorithm is based on the
following:
1)
The instruction order has to be maintained.
2)
The value carried by the counter, which is assigned to
a specific register can change only when the register is

used as the sink register by the instruction that is being
scheduled.
3)
The maximum number of source registers that can be
specified are two and the number of sink registers is one.

4)
The registers and the counters are all initialized to
zero at the start of operations. The first instruction is

executed
assuming zero possibility of either hazards or
collision. Let each instruction be represented by
Ik =
O C , r a trb,r,, ca,cb,
C, ) , where OC is the op-code of the
instruction, ra,rb,rcare the registers used by the

instructions, and ca,cb,ccare the counters that are
designated to the registers ra,rb,and r, respectively. To
schedule an instruction, there are three possibilities that
need to be investigated: 1) all counter elements associated
with the source registers are zeros, 2) only one counter
element associated with the a single source register is
zero, and 3 ) none of the counters assigned to the source
registers are zeros. Equations for the instruction

scheduling are developed as analysis is carried out for each
case. The equations are then summarized.
CASE 1:
In this case no RAW hazard is involved. The data
operands are currently available in their respective source

registers and the instruction can be issued to the execution
unit without assigning any delay. This instruction will
place
the result in the sink register after execution. The
result will be assigned to the destination register after

T pipeline cycles which is given by
T,
(2.4)
T,
where T, is the time required for execution by the execution

unit and is fixed by the system design. T, is the system
delay that is fixed by the overheads in the system. Hence
the result will be placed in the sink register after T
pipeline cycles to the present cycle. Consider the set of
instructions given below:
2.
load
load
rl, (X); rl <-r2, (Y); r2 <--
8.
9.
mult
store
1.
..
. .. .
. . . . . ..
(X)
(Y)
..........
1 , 2
r3 <-- rl + r2
(Z), r3; (Z) <-- r3
Let the first instruction be the load instruction. This

is issued to the execution unit and cl is initialized to 6.
It takes six pipeline cycles for rl to read (X). Similarly
in the second pipeline cycle c2 is loaded with 6. The
contents of c, during the second pipeline cycle will be 5.
34
If the multiplication instruction is executed after the
eighth cycle, the data is readily available and the

instruction can be issued without assigning any delay.
In the above example Te is equal to 4 pipeline cycles
and T, is equal to 2 pipeline cycle resulting in T being 6
pipeline cycles. The multiplication instruction is issued
at the ninth pipeline cycle. TteSt
is used to check against
the WAW hazard and is numerically equal to the sum of T and
any other delay.
'test
Tadditional delay =
(2-5)
In this case where no RAW hazard occurs, the additional

delay term is
0.
The WAW hazard is a possibility if the c , ~ ~ is

~ (not
~ ~ ~ )
zero. The subscript
llsink(old)llrefers to the current
value of the counter associated with the sink register (

used by the present instruction) which has not yet been
updated by the issue unit. This implies that a previously
initiated instruction I using the same sink register has not
yet been updated. If the present instruction is denoted as
instruction J, then R(1)
delay
R(J)
is not equal to
The TSink-
is the delay assigned to the instruction by the issue
unit to resolve the WAW dependencies. The calculation of the

delay depends on two cases: A) the value of the counter
Csink(old)
is greater than T,,,,, and B) the value of the
is( less
~ ~ ~than
)
TteSt.
counter element c , ~ ~ ~
CASE A:
Csink(old)
is greater than TteSt

implies that a WAW hazard
will occur. The instruction has to be delayed until the
in
Csink(old)
is less than Ttes,. The difference between
Csink(old)
and
can be set by the system or be a fixed
'test
value. In this research the value is fixed and is equal to

two pipeline cycles. The Tsink-delay
is calculated as follows:
Tsink-delay
Let
Tinst-delay
Csink(old)
- (Te +
(2 6 ,
1)
represent the total time delay assigned to the
instruction to resolve the RAW and WAW hazards. The Tinst-delay

is numerically equal to the Tsi,,k-delay
in the absence of the
RAW hazard.
Tinst-delay
(2-7)
Tsink-delay
The new value of c
~ can
~ be
~ set
( according
~ ~ ~ )to the
following equation :
Csink(new)
= Te
= Ts
(Tinst-delay
Ts
Csink(old)
Csink(old)
Csink(old)
(2.8)
( T e + 1)-1 (2.9)
(2.10)
(2.11)
In equation 2.8 the term lt(Tinst-delay- 1)" is used

because of the overlap of the delay value becoming zero and
the begining of execution for the instruction. If a delay
of 1 0 c y c l e s is assigned t o t h e i n s t r u c t i o n , t h e n
instruction will start execution when the delay decrements
to 0. Thus the time that the result will be loaded into the
register will be (9 + execution time) rather than (10
execution time). For example, to execute the multiplication
36
instruction, the contents of c3 must be evaluated. If c3 is

non zero then there is a possibility of WAW hazard. Let the
contents of c3 be 1 2 at the time the multiplication
instruction is being issued. This implies that the previous
instruction that has used r3 as its sink has not completed
the execution and there will be an additional 12 cycles
before the previous instruction will update r3. The Te for
the mult instruction is 6 pipeline cycles. From the equation
I
Tinst-delay
is computed to be 7 pipeline cycles. It is
evident that if the instruction is not delayed, the present

multiplication instruction will initialize the register r3
with the wrong value. This is not acceptable as it gives
rise to WAW hazard. The multiplication instruction should
be executed after 7 pipeline cycles. The result of the
present instruction will be loaded into r3 after 14 pipeline
cycles from the current cycle. Hence c3 is initialized to 14
before the instruction is issued. The new value of c3 is
used
to determine WAW hazards with the instructions
logically following the multiplication instruction.

CASE B:
The counter element assigned to the sink register
is less than or equal to T,,,,. The possibility of WAW hazard

exists and this warrants that a delay be introduced by the
system. The delay assigned is two pipeline cycles. The
differs from the previous case.
calculation of TsinkSdelay
(2.12)
Tsink-delay =
if
Ttest
'
Csink(old)
= 0
Tsink-delay =
if
'test
Tsink-delay =
if
Ttest
Csink(old)
= 1
0
Csink(old)
> 2
The new value of c , ~ ~ is

~ (calculated
~ ~ ~ )
as follows:
Csink(new)
- T + Tinst-delay
= T + 2
for equation (2.12)

Csink(new)
- T + TinSt-delay
= T + 1
for equation (2.13)

Csink(new)
'
Tinst-delay =
for equation (2.14)

CASE 2 & 3:
.
.
In this case the counters associated with
one of the source registers are non-zero. The RAW hazard is

a definite possibility and has to be resolved. The
instruction must necessarily be delayed until the source
is
dependencies are resolved. Another delay term Ts,c-de,,Y
introduced in the total delay equation. Tsrc-delay
is the
additional delay element in the calculation of Ttest.
This
delay term is equal to the non-zero counter value associated
with the source register in case 2 and is equal to the value
computed by equation 2.20, in case
3.
Both cases cannot
is necessary as the
exist simultaneously. The Tsr,.delay
execution of an instruction will have to be delayed until
the RAW hazards are resolved. The test total time is now
equal to:
case 2:
Tsrc-delay
Csource- reg
+ 1
case 3:
-
Tsrc-delay
( Csource- reg1 r
) + I
Csource- reg2
(2.20)
The WAW hazard is checked in the same manner as in case 1.

The difference between case 2 and case 3 is that Tsrc-delay
has to be taken into consideration in deciding Tinst.delay.
The
Tinst-delay is calculated simillar to case 1.
If csink(o1d) > Ttest:
-
Tinst-delay
Csink(new)
Csink(old)
if
<
Ttest
=s '
'test
Ttest
Tinst-delay
Csink(old)
'
(2.21)
Csink(old)
(Te + 1)
Csink(old)
(2.22)
(Te + 1)-1 (2.23)
(2.25)
Csink(old)
= 0
+ 2
(2.26)
Tsrc-delay
(2.27)
Csink(old)
(2.28)
> 2
'src-delay
~ calculated
~ ( ~ ~ as
~ follows:
)
The new value of c ~ ~ is
%ink(new)
for equation (2.26)

Csi nk(new)
(2.24)
Csink(old)=
Tinst-delay
if
Ts
- Tsrc-delay
(Tinst-delay
Tinst-delay
if
the delays are calculated as follows:
'test!
=e '
If
Csink(old)
(Tinst-delay
'src-delay
(Tinst-delay
'
Tsrc-delay
for equation (2.27).
'
Csink(new)
('inst-delay
(Tsrc-delay
for equation (2.28)
The equations to resolve the dependencies are summarized

below:
In the abscence of RAW and WAW hazards, the expressions
'test
'test
and
=
are as follows:
Csink(nex)
T
=
Csink(new)
~ ~ -~Csource~ (reg1~- ~Csource~ reg2

)
The values of c
= O
In the abscence of RAW hazards, TteSt

and
Csink(new)
are
shown below:
=
'test
(2.32)
'inst-delay
if
if
'
Csink(old)
'inst-delay
Tinst-delay
The
2 - (Te - 1)
(2.33)
'test
%ink(old)
Csink(new)
if
is equal to zero, one or two

<
'test
Csink(old)
'
'
('inst-delay
> 0.
csource-regl
Csource- reg2 = 0 .
The equations for determining the delays to resolve RAW

and WAW hazards are summarized below:
'test
'inst-delay
('src-delay
- Csink(old)
(2.35)
2
(Te - 1)
(2.36)
40
if
>
Csink(old)
'test
is
Tinst-delay
(Tsrc-de[ay
added with zero, one or
two)
if
Ttest
Csink(new)
if
<
Csink(old)'
Tinst-delay
(Tinst-delay
(2.37)
> 0.
The process of scheduling the instructions is shown in

Fig. 2.16. The result of the scheduling process is
illustrated in the Fig. 2.9. The individual RAW and WAW
components are derived and are also illustrated in the Fig.
2.16 for each instruction. The algorithms are based on the
counters that monitor the write process to each register.
It is also necessary for the issue unit to recognize the
c a p a c i t y in w h i c h each r e g i s t e r is utilized. T h i s
information is stored in an auxilliary unit which is made
available to the decode and the issue unit. This unit is
known as the instruction status unit. The instruction status
unit is a two dimensional array of fields representing the
decoded instruction. The unit contains four major fields.
The fields are encoded in the following manner: 1) the
opcode field contains the opcode of the present instruction,
2) the execution time field represents the execution time
TI 3) the R field denotes the utilization of the registers

by the instruction. These registers are the general purpose
system registers that are utilized by the functional units.
They can be used as a source or as a destination register
Pipeline Cycle # 3
lnstruction
load R1, (X);

RAW hazard delay
I
I
Initial counter values
1
I
WAW hazard delay
I
0
lnstruction delay
I
0
Pipeline Cycle # 4
lnstruction
load R2, (W;

RAW hazard delay
II
I
1I
WAW hazard delav
Updated counter values
lnstruction delay
Pipeline Cycle # 5
lnstruction
IRAW mult
R3, R1, R2;
hazard delay
i
b
I
i
WAW hazard delay

0
lnstruction delay
Pipeline Cycle # 6
lnstruction
store (Z), R3;
RAW hazard delay
12
WAW hazard delav
t
Instruction delay
I
Fig. 2.16
13
The counter values while scheduling the instructions
Pipeline Cycle # 7
lnstruction
I add R3, R1, R2;
RAW hazard delay
4
WAW hazard delav
lnitial counter values
I
b
lnstruction delay
11
Pipeline Cycle # 8
lnstruction
store (U), R3;

RAW hazard delay
I1
13
WAW hazard delay
I
C1
C2
3
C3
C4
1 3 0
C5
0
]=I
1
1I
1I
lnstruction delav
Pipeline Cycle # 9
lnstruction
load R4, ( 6 ) ;
RAW hazard delay
I
0
1
WAW hazard delav
1
1I
-
lnsruction delay
0
Pipeline Cycle # 10
lnstruction
1 load R5, (D);
RAW hazard delay
1
WAW hazard delay
0
lnstruction delav
Fig. 2.1 6
I
1
I
The counter values while scheduling the instructions (cont'd)
Pipeline Cycle # 11
lnstruction
1
add R3, R4, R5;
RAW hazard delay
1I
6
WAW hazard delay
6
lnstruction delay
1
I
Pipeline Cycle # 12
lnstruction
store (V), R3:

RAW hazard delay
13
b
WAW hazard delay
0
lnstruction delay
b
Fig. 2.16
I
I
The counter values while scheduling the instructions (cont'd)
44
by the instruction, and 4) the C field represents the time

when the registers will be initialized to the new value by
the instructions using the registers as sink registers. The
R and C fields are further divided into subfields. The
number of subfields in the R field are equal to the number

of subfields in the C field. The R fields are set by the
decoding unit. Every subfield in the C field is a counter.
Each counter is associated with a single register. The
counter subfield c, represents the time that the register r,
will be initialized to a new value by the most recent
instruction. The subfield c2 represents r2 and so on.
Similarly, every subfield of the R field represents a single
register. The subfields of R are set by the decode unit. The
subfields are set to 1, if the register is used as a source
register, set to
0 ,
if the register is used as a sink
register and set to 3, if niether are true. The value three

represents don't care. For example, the R 1 subfield is set
to 1 by the decode unit, if register R 1 is used as the
source register by the instruction. The counter fields are
updated by the issue unit. The unit is shown in Fig. 2.17.
The change to Fig. 2.2 is illustrated in Fig. 2.18.
T h e i s s u e u n i t s c h e d u l e s t h e e x e c u t i o n of a n
instruction in each pipeline cycle. The execution of the
instruction may be delayed. The delayed instruction must be
stored until it is ready to execute. Two schemes are
possible: 1) hold the instruction in the issue unit and
0
Code
C - Field
R - Field
Exec
Time
R1
R2
R3 R4
R5
C1
C3
C2
C4
C5
I llllllllll
>
Opcode is the opcode of the Instruction.

lime: Time is the time required to execute the instruction.
R Field : R - field is the field of all registers in the system..
C - Field : The field of the counters that keep track of the registers.
Fig. 2.17
Instruction status unit.
Fetch Unit
i
Decode unit
D
Instruction
status
unit
Issue unit
.c
Fig.
2.18
The
Floating
point u n l
v
1
Logic
unit
E 2
Fixed point
arithmetic
unit
modified pipeline system with the
instruction status unit
47
freeze the total PIU until the dependency is resolved and

2) issue the instruction to a buffer provided at the
entrance to the execution units. The former scheme will

reduce the efficiency of the pipeline system. There could
be instructions downstream, that can be executed and not
dependent on the instructions in execution. In our example,
instruction 7 is not dependent on any of theprevious
instructions. If the PIU is freezed, the instruction 7 will
remain in the fetch unit until the PIU is operational again.
A FIFO queue can be introduced between the units of the PIU

to hold the instructions and keep the fetch unit
functioning. This will create a bottleneck as the issue unit
is still disabled and dynamic scheduling will not be
possible. Thus the effective solution is adopt the latter
scheme and place buffers at the entrance to the execution
units. The non-executable instructions can remain in these
buffers until they are ready to execute. This will free the
issue unit to issue instructions to the execution unit. The
execution unit will also be able to start execution of
instructions that are issued for immediate execution by the
issue unit. In our example, instructions 5 and 6 can be
placed in the buffers and execution of the instruction 7 can
begin during the nineth pipeline cycle. The ideal flow
through the PIU is maintained. The space time diagram for
this scheme is illustrated in Fig. 2.9. The changes in the
structure with relation to Fig. 2.18 is shown in Fig. 2.19.
Fetch Unit
I
Decode unit
4
D
Instruction
status
unit
Issue unit
4
Buffer
I
units
88
-
Buffer
units
units
Logic
unit
Fixed point
arithmetic
unit
Fig. 2.19
Vl
Buffer
Floating
point uni
The pipeline system with the buffer units
Fig. 2.20
Instruction listing to illustrate WAR hazard
Fig. 2.21
Resolving WAR hazard using the counters.
51
The WAR hazard arises when the resources are not

distributed to the instructions in the buffer as they become
available. Fig. 2.9 is reproduced to illustrate the
possibility of RAW hazard in Fig. 2.20. The WAR hazard will
exist between the instruction mult r3, rl, r2, store (Z),
r3, and add
r3, rl, r2. The instructions are highlighted
in a block in Fig. 2.21. The counter values are also shown

alongwith the instructions. The three instructions are
issued to the buffer. The store instruction must capture the
value of r3, before the add instruction changes the content
of r3. When the store instruction is issued, the counter c3
associated with r3 contains a value of 12. It indicates that
the result of r,
r2 will be loaded into r3 after 1 2
pipeline cycles. A pointer is introduced in the buffer

holding the store instruction. This pointer is initialized
to the value of c3 at the time of issue. The pointer counts
down by one in each passing pipeline cycle. The pointer is
independent of c3. At the time that the pointer counts down
to 0, the register r3 will be loaded with the result. This
result can be loaded into the buffer before the instruction
begins execution. Fig. 2.22 illustrates the events. The
buffers in each stage are collectively called as a delay
station (DS). Each delay station consists of 10 identical
buffers called as delay buffers (DB). Each delay buffer (DB)
holds an instruction until it is ready to execute. Each
delay buffer is further subdivided into nine fields:
Pipeline
Cycles
Polnter
c1
load r l ,
(XI;
Pointer
c2
load r2,
(Y);
Polnter
c3
mult r3, r l ,
r2;
....
....
Pointer
c3
add r3, r l ,
r2;
.....
.....
Pointer
c4
load r4,
(B);
.....
Polnter
c5
load r5,
(Dl;
.....
Pointer
c3
mutt r3, r4,
r5;
Fig. 2.22
The various events of the scheduling process
Pr#
ASR1
DSR1
SD1
ASR2
Unit 1
Unit 2
Unit 3
Unit 4
Unit 5
Unit 6
Unit 7
2
Pr # : Priority number attached to each unit

ASR1 : Address of source register 1
DSR1 : Delay of source register 1.
SD1 :Source Data 1.
ID : Instruction Delay.
SD2 : Source Data 2.
DR : Destination Resource
Each Unit is a Delay Buffer.
Fig. 2.23 Structure of delay buffers
DSR2
SD2
ID
DR
54
1) priority number, 2) address of source registerl (ASRl),
3) delay of source registerl (DSRl), 4) source data1 (SD1),
5) address of source register2 (ASR2), 6) delay of source
register2 (DSR2), 7)source data2 ( S D 2 ) , 8) Instruction delay

(ID), and 9) destination register (DR). The structure of the
delay buffers is illustrated in ~ i g .2.23. The DSRl field
indicates the number of pipeline cycles (fromthe present
cycle) required by the source register1 to initialize itself
to the correct value. The same concept applies to the DSR2
field. The ID field indicates the time that the instruction
is allowed t o start the process of execution in the
arithmetic or logic unit. The delay fields essentially
decrement by one step in each pipeline cycle. They do not
count down below zero. The delay fields in the buffers are
the pointers that keep track of the source registers. When
the source operand is not available at the time of issue,
the counter value associated with the source register is
loaded into one of the pointers in the buffer. The address
of the source is also loaded into the address fields in the
buffer. If the value of the counter is loaded into DSR1,
then the address of the source register must be loaded into
ASR1. Regularity is maintained. When any of the delay fields
associated with the sources reach zero, the address of that
source is released from the source address field and the
data is latched in the associated source data field. The
data is read from the common data bus that links each
> .
Register 1
Register 2
Register 3
Register 4
Register N
I
b
t T I L
SPLITTER.
II
MUX5to1
A
PR# ASRl DSRl Data source 1
ASR2 DSR2 Data source 2
Fig. 2.24 Connectionist model of delay buffers.
ID.
DR
Fetch Unit
Decode unit
-C
Thick lines
represent
common data bus.
Instruction
status
unit
R a
e r
9 r
i a
s Y
Issue unit
.
Buffer units
Fixed point
arithmetic unit
Fig. 2.25
Buffer units
b'Buffer
units
dm' e
Floating point unit
The pipeline system shown along with the register array
57
register to the source data fields in the buffers through

a multiplexer. This multiplexer chooses the data path in
lieu with the source address present in the identification
field. The connectionist model is illustratedin Fig. 2.24.
The changes to the structure are shown in Fig. 2.25.
RESOLVING OPERATIONAL HAZARDS:
This collision hazard occurs when the assigned delays

of two different instructions in the same DS, are nullified
in the same pipeline cycle. This hazard also occurs when an
instruction cannot be executed because of latency not being
available. It can be resolved by introducing extra time
d e l a y s t o a l l instructions t h a t a r e i n t h e DS. T h e
scheduling algorithm in the issue unit assigns time slots
for the execution of each instruction. The time slot
assigned to each instruction in the DS is fixed in time,
with respect to the other instructions.
In case of a
conflict between two instructions, the instruction with the

highest priority is executed and a fixed amount of delay is
introduced to all the instructions in DS. The delay is added
to the existing delays of the source delay counters and the
instruction delay counters. The source delay counters which
have already counted down to 0 are not updated by this
operation. The counters in the instruction status unit are
also updated with the same amount of delay. This ensures
that the relative positions of the time slots for execution
of instructions are not changed. The captured data in the
Consider the set of instructions below in the time space diagram
lnstruction 1
lnstruction 2
lnstruction 3
FDIEEEE
FDIEEEE
FDIEEEE
lnstruction 4
lnstruction 5
lnstruction 6
FDI.
Operational hazard present between instruction 5 and 6.
lnstruction 1
lnstruction 2
lnstruction 3
lnstruction 4
FDIEEEE
FDIEEEE
FDIEEEE
F D I .... E E E E E E
lnstruction 5
FDI..
Instruction 6
FDI
EEEEE
.......
EEEE
Additional delay
introduced by the
execution unit.
Fig. 2.26 Resolving the collision of instructions in PEU.
buffers is not lost and the new instructions are scheduled

with the updated c o u n t e r v a l u e s . This principle is
illustrated in Fig. 2.26. In simple terms, the execution
all instructions in DS are moveden-masse in time without
disturbing the order.
the instruction cannot be issued
due to lack of latency, the delay required is equal to the

number of pipeline cycles for the first available latency.
This re-scheduling is carried out independent of the issue
unit. This principle is best illustrated in the example
given below. Consider the instruction set listed below:
load
load
mult
mult
store
store
rl,
r2,
r3,
r4,
r3 ;
r4;
20;
30;
r2,rl;
r2,rl;
The load instruction will be issued in the third

pipeline cycle followed by the second load instruction in
the fourth pipeline cycle. cl and c2 are set to 6 at the time
of issue. rl will contain the value of 20 in the nineth
pipeline cycle and r2 will be loaded with 30 during the
tenth pipeline cycle. The first multiplication instruction
will be issued in the fifth pipeline cycle. During this
cycle, c, will contain the value of 4 and c2 has the value
of 5. The counter c3 associated with the sink register r3
will be set according to the equation 2.16. There is no WAW
hazard as c3 is initially equal to zero. The instruction
delay is computed as given in eqn 2.19 which is 6 pipeline
60
cycles. Thus c3 will be updated to 14. The result of this

instruction will be in r3 at the nineteenth cycle. The
second multiplication instruction is issued next with a
delay of 5 pipeline cycles. The value of c, is set, similar
to the first instruction and is equal to 13. The events and
the counter values are illustrated in Fig. 2.27. The
counters c, and c2 are decremented, as the event of updating
the registers draws nearer. The first store instruction is
issued during the seventh pipeline cycle. The delay is
computed depending on c3 which is equal to 13 cycles. The
last instruction is issued in the eight cycle with an
assigned delay of 12 cycles. The state of the instructions
in the pipeline during the cycles 7 and 8 are illustrated
in Fig. 2.28. At the eleventh pipeline cycle, both the
multiplication instructions are ready to be executed. Two
generic instructions cannot be executed from the same stage
at the same time. The first instruction has a higher
priority and was loaded into the DS one cycle ahead of the
second multiplication instruction. As a result, the
execution of the second instruction has to be delayed by one
cycle. This implies that all the instructions that are
dependent on the second instruction will also have to be
delayed by one cycle. This has a recursive effect on the
instructions downstream. Since the issue unit fixes the time
slot for execution, the relative placement of the time slots
between the second multiplication instruction and the
Pipeline Cycle # 3
load R1, 20;
Initial Counter values
1
Pipeline Cycle # 4
load R2, 20;

Pipeline Cycle # 5
mult R3, R2, R1;
Pipeline Cycle # 6
mult R4, R2, R1;
Fig. 2.27
Sequence of events and the counter values
Pipeline Cycle # 7
store
R3;
Pipeline Cycle # 8
store
I - I
1
R4;
10
10
Pipeline Cycle #
Pipeline Cycle #
Fig. 2.27
Sequence of events and the counter values (cont'd)
Pipeline Cycle # 7
store
R3;
11
11
Pipeline Cycle # 8
store
R4;
Pipeline Cycle #
Pipeline Cycle #
I
c
Fig. 2.28
Sequence of events and the counter values
Pipeline cyde #
Let the counter values before updating be :

C1
C2
C3
C4
Assuming a delay of 'k' pipeline cycles are needed to resolve the hazard
Counter values after updating are :
Assuming that the contents of DSR1 of unit 1 is 0 and that of DSR2 of

the same unit is 3 and the ID field is 7
Updating the delay buffers by adding the offset 'k' to all non zero
delay fields. The updated delay station is presented below

ASRl : Address of source register 1
DR : Destination resource

Fig. 2.29 Updating of the delays due to collision hazard
65
downstream instructions must not be changed. Hence all the

delays are incremented by one. The non zero source delays
are also incremented in the delay buffers. The value of r3
will remain longer in the register for one extra cycle more
than the original scheduled time. The process is illustrated
in Fig. 2.29. In general, the instruction Ij will influence
Ij+,ratherthan Ij-,
. Hence, this displacement does not affect
the previous instructions. It is evident from our example
that the first two instructions are not inconvienced by this
displacement. Graphically it is illustrated in Fig. 2.30.
The logic instructions are also treated in the same manner.
2.3
REDUCING BRANCH PENALTY:
A typical instruction set, of any computer consists of
two types of branch instructions. They are the conditional

branch instructions and unconditional branch instructions.
The unconditional branch instruction will initiate a jump
in the current flow. The conditional jump instruction will
initiate a jump only if the element to be evaluated is
positive to the condition. For example, let a branch
instruction specify a branch to location # 60, if only the
register R5 is equal to zero. The branch will take place
only if the condition is positive i.e., the register R5 is
equal to zero. The sample instruction set listed above is
now modified to include a conditional branch instruction and
is listed below:
1.
2.
load
load
rl, (X);
r2, (Y);
rl <-r2 <--
(X)
(Y)
Pipeline cycle # 6
C1
Before updating
After updating
Before updating
Pr #
ASR1 DSR1 SD1
Unit 1
R1
ASR2 DSR2 SD2

R2
Unit 2
h b
m
U
ID
I33
R3
After updating
Unit 2
Before updating
ASR1 DSRl SD1
ASR2 DSR2 SD2
9 b
i
C
f
f
Unit 2
After updating
n
i
t
Fig. 2.30
The process of capturing the operands and resolving collisions
Pipeline cycle # 7
C1
Before updating
After updating
Before updating
L
h b
u
f
Pr#
ASRl DSRl SD1
Unit 1
R1
R2
R3
unit 2
R1
R2
R4
ASR2 DSR2 SD2
ID
a?
After updating
Before updating
After updating
Fig. 2.30 The process of capturing the operands and resolving collisions (cont'd)
Pipeline cycle # 8
C1
Before updating
After updating
Before updating
Pr #
ASRl DSRl SD1
Unit 1
R1
unit 2
R1
h b
f
ASR2 DSR2 SD2
ID
R2
R3
R2
R4
DR
After updating
Before updating
Unit 2
After updating
Pipeline cycle # 9
Before updating
After updating
Before updating
h b
m
U
ASR2 DSR2 SD2
ID
Pr #
ASRl DSR1 SD1
DR
Unit 1
R1
20
R2
R3
Unit 2
R1
20
R2
R4
After updating
Before updating
Mem
After updating
Unit 2
Pipeline cycle # 10
C1
Before updating
After updating
- -
Before updating
h b
u
f
ASR2 DSR2 SD2
CR
Pr #
ASRl DSRl SD1
Unit 1
R1
20
R2
30
R3
unit 2
R1
20
R2
30
R4
ID
After updating
Before updating
f
f
Mem
After updating
Pipeline cycle # 11
C1
Before updating
After updating
Before updating
h b
u
f
"
Pr #
ASR1 DSRl SD1
Unit 1
R1
20
R2
30
R3
unit 2
R1
20
R2
30
R4
ASR2 DSR2 SD2
ID
a?
After updating
Unit 2
Before updating
Mem
After updating
Mem
Pipeline cycle # 12
Before updating
After updating
Before updating
k
Pr #
ASRl
DSRl SD1
Unit 1
R1
20
ASR2 DSR2 SD2

R2
30
ID
0
R4
Unit 2
After updating
Unit 2
--
--
Before updating
g
i
C
b
f
f
After updating
Unit 2
Pipeline cycle # 13
C1
Before updating
After updating
Before updating
Pr #
ASRl DSRl SD1
ASR2 DSW
SD2
ID
Unit 1
Unit 2
After updating
Before updating
f
f
Mem
After updating
Unit 2
3.
4.
5.
6.
7.
8.
9.
add
store
branchz
load
load
mult
store
r3 , rl, r2;
r3 <-- rl + r2
(Z), r3;
(Z) <-- r3
r3, 100;
branch to 100 if r3 = 0
r4, (A);
r4 <-- (A)
r5, (B);
r5 <-- (B)
r3, r4, r5;
r3 <-- r4 * r5
(C)I r3
(C) <-- r3
The instruction 6 will be executed depending on the

outcome of
instruction 5. Instruction 5 will be fetched by
the fetch unit at the beginning of the fifth pipeline cycle.

It will reach the execution unit at the beginning of eighth
pipeline cycle. It is necessary to stop further issue of new
instructions, until the branch instruction is evaluated. The
PIU is freezed until the validity of the branch instruction
is determined. If the branch is positive, then instruction
at location # 100 will be the next instruction to be issued.
On the other hand, if the result is negative, no branch is
initiated and the next instruction to b e issued is
instruction 6. The time from the sixth cycle to the cycle
that the branch instruction is evaluated, is wasted and
furthermore, a few cycles are lost in reconfiguring the
f e t c h unit. T h i s t i m e c a n b e u s e d t o pre-fetch the
instruction from the destination address and along with
instruction 6. An additional stream is needed to handle the
second fetch. Hence the fetch unit is endowed to feed two
instruction streams. A unit to classify the instruction and
generate the effective address is necessary for the second
stream to become operational. The branch instruction will
not be evaluated until the operand is current. During this
75
time the activity in the decode and the issue units is

suspended, but the fetch unit can prefetch two instructions
and feed t w o FIFO queues. These queues c a n hold the
instructions of the current flow
and the instructions
starting from the destination address. The queues would be

best placed in the decode unit. Two program counters are
used to fetch instructions to both the streams. A path
controller is necessary to direct the instruction flow to
the two queues. The system is modified as shown in Fig.
2.31. The current stream is known as the present instruction
counter stream (PIC) and the secondary stream is termed as

the effective address counter stream (EAC). To maintain
symmetry, the system consists of two issue and two decode
units, one for each stream. The EAC stream will become the
current stream when the outcome of the branch instruction
is positive. The PIC queue is flushed up to the issue unit.
The instruction flow is resumed from the E A C queue. The
first instruction is fetched from memory by initializing the
program counter of the PIC stream with the starting address.
Subsequent instructions are fetched in each pipeline cycle
by incrementing the program counter. The current instruction
is examined by the instruction classifier to classify the
type of the instruction. If the instruction belongs to the
class of unconditional branch instructions, the program
counter is updated with the new address and the next
instruction will be fetched from the new location. If the
CI
EAC queue
Fetch Unit
PIC queue
Decode unit
L
Instruction
status
unit
I.
Buffer units
Issue unit
Buffer units
a
r
r
a
Buffer units 4I*r t
3
Floating point unit
It_l
Fixed point arithmetic

unit
Thick lines represent common data bus.
Fig 2.31
R
e
9
i
The complete pipeline system shown with two streams
e
* r
Logic unit
77
instruction belongs to the class of conditional jump

instructions, the destination address is stored in the
program counter of the EAC stream. The EAC stream is nonfunctional until the PIC stream encounters the first
conditional jump instruction. In the ensuing cycles, the EAC
stream fetches instructions starting from the destination
address computed from the jump instruction. The pre-fetched
instructions are stored in the EAC queue which is present
in the decode unit. The validity of the jump instruction
will determine the condition of the streams. If the jump is
negative, the EAC queue is flushed and the PIC stream
remains the current stream. If the jump is positive, the EAC
stream becomes the current stream and the PIC stream is
flushed. There is no delay because the next instruction is
available in the EAC queue. The EAC stream remains current
as long as no branch instructions are encountered. If a
branch instruction is encountered, the PIC queue will start
filling up with the instructions from the address provided
in the branch instruction. The afore mentioned scheme will
operate with a single program counter for each stream, when
there are no multiple jump instructions encountered in the
streams, before the current branch instruction is evaluated.
In general case there could be multiple jump instructions
encountered by the fetch unit in both the streams, while
forwarding the instructions to their respective queues. Even
though the decode unit and the issue unit are disabled by
29
lnstructions
starting from
address 23 in
memory
EAC stream
Jump (Result = 0) 60
28
,27
26
Jump (Carry = 0) 45
25
Jump (Overflow = 0) 36
24
Jump (Carry = 0) 28
23
Instructions
starting from
address 10 in
memory
16
15
-
14
Jump (Carry = 0) 56
13
12
11
PIC stream
10
Jump (Carry = 0) 23
Fig. 2.32
Sample instructions in memory
79
a branch instruction, the fetch unit will remain active

until queues in the decode unit are filled. Consider the
instructions in memory as listed in Fig. 2.32. Let n, be the
jump instruction encountered by the PIC stream. The program
c o u n t e r of t h e EAC stream is initialized w i t h t h e
destinationaddress. Two instructions are fetched from the
next cycle, one for the PIC stream and the other for the EAC
stream. Branch instructions m, and nz are encountered
s i m u l t a n e o u s l y by t h e s t r e a m s . T h e f i r s t b r a n c h
instructionn, is currently in the issue unit being
scheduled. Instructions cannot be pre-fetched from the
destination addresses of either m, or n2. A total of four
s t r e a m s a r e r e q u i r e d t o p r e f e t c h t h e n e w s e t of
instructions. It is not possible to flush any of the streams
as the jump instruction n, is not evaluated.
The jump
instruction cannot be forwarded to the decode unit as the

decode unit does not have the ability to generate an
effective address. Assuming n jump instructions in the PIC
stream and m jump instructions in the EAC stream have been
identified by the fetch unit before it is disabled, a tree
can be formed to illustrate the possible logical paths. For
example, let m
= 4
and n = 5. The tree is formed in Fig.
2.33. The parent node is the current jump instruction that
is being processed. The paths to the left indicate the jump

is valid and the paths to the right indicate that the jump
is invalid. The child nodes are the branch instructions
PC = Program counter
The jump instruction n is being evaluated in the logic unit. The issue
unit and the decode unlt have suspended operations until the jump is
evaluated
Fig. 2.33 Graphical representation of data path due to branch instruction
81
belonging to both the streams. Starting from the parent

node, the branches to the right or left are deleted as each
node in the path is evaluated. If the jump is valid then the
branches to the right are eliminated along with the child
nodes connected by the branch. Assuming that the branch is
taken by the parent node, the node m, becomes the parent
node for the first branch in the new path. It is evident
that the next branch instruction that has to be evaluated
is directly dependent on the present branch instruction. It
is not possible to accurately indicate the outcome of a
branch instruction until it is evaluated. So when more
branch instructions are encountered in both the streams, the
combination of the program paths is equivalent to
2^n where
n is the number of branch instructions. With a single

program counter for each stream, pre-fetch cannot be carried
out until one of the streams is flushed. The destination
address would be lost if the jump instruction is forwarded
to the decode stage un-processed. The instructions along the
same stream can be accessed by default without changing the
stream. The opposite stream will become the program path,
when the jump instruction to be evaluated initiates a jump.
Hence the destination address of the jump instructions that
await evaluation in the PIC queue must be associated with
the EAC stream. Additional counters which are associated
with the EAC stream record the destination addresses before
they are forwarded to the decode unit. When the jump is
82
taken, the E A C queue becomes the current queue and the

destination address of the branch instructions in the EAC
queue must be associated with the P I C queue. Hence the
destination addresses are held in the additional counters
associated with the P I C stream. This scheme aides the
pipeline system to reduce the branch penalties. In our
example, the destination address of n2 is held in the
counter 1 of the present E A C stream and the destination
address of m2 is held in counter 1 of the present P I C
stream. If nl is positive then E A C stream becomes the
current stream. The P I C stream is flushed and pre-fetching
can be started in the next cycle as the address is available
in counter 1. Similarly, if the branch instruction nl is
negative, the present P I C stream remains as the active
stream. The EAC stream and queue is flushed and prefetching
starts in the next pipeline cycle by using the address of
n2 in counter 1. The Figures 2.34 to 2.36 illustrate the
sequence of operation assuming that the branch is taken and
ml is the new parent node. The instructions starting from
the address #28 are fetched by the P I C stream.
A flow chart
depicting the events is shown in Fig. 2.37. The new fetch

unit is illustrated in Fig. 2.38.
2.4
HARDWARE SYSTEM:
The pipeline system is designed at the system level

with the individual units of the P I U and the P E U . The
complete system is shown in Fig. 2.39. The individual units
Instructions strating from address

23 in memory

10 in memory
16
15
14
Jump (Carry = 0) 56
'1 3
12
11
10
Jump (Carry = 0) 23
EAC stream
PIC stream
14
PC. 15
PC 1 60
Fig. 2.34 Sample instructions and the possible data paths
EAC stream counter

Fig. 2.35
PIC stream counter
The contents of the counter after fetching the last instruction in both the sbeams
Assuming the jump is being evaluated in the logic unit

23 in memory
EAC stream -b PIC stream
XXX in memory
PIC stream
EAC stream
PC = Program counter
The jump instruction n has been evaluated and the branch is taken. The old
PIC stream is the redudant stream and hence it is flushed.
Fig. 2.36 Sequence of updating the counters during the jump operation
Instructions from memory
Counter 1. (PIC I EAC counten)
Path selector and controller
Control Paths
disable individual
streams
instructions to the
EAC queue
Fig. 2.37
The fetch unit
instructions to the
PIC queue
4
b
No
t
Fetch instruction
Unconditional
Place in PIC queue
PC(P1C) <-- PC
+1
EAC stream in
Load the EA into

the first available
empty counter of the
EAC stream
Load the EA into

the program counter
of the EAC stream
Yes
Start The EAC
stream
u
Fig. 2.38 Flow chart for the PIC queue assuming PIC queue is in session
Clear the program

counter and the associated
counters 1-n of the
Clear the program

counter and the assodated
counters 1 -n of the
EAC field
Load program
counter with contents
of counter 1
Move the contents of the

counters one counter to
the left
Yes
b
Fig. 2.38 The flow chart of the PIC queue assuming PIC queue in session (cont'd)
No
Fetch instruction
Unconditional
Place in EAC queue
PC(EAC) <-- PC + 1
dad Ihe EA into

;he first available
mpty counter of tbe
PIC stream
Load the EA inro

the program counter
of the PIC stream
Yes
Start The PIC
stream
Fig. 2.38 Flow chart for the EAC queue assuming EAC queue is in session (cont'd)
No
Yes
Are the counters
Ciear the program

counter and the associated
counters 1-n of the
PIC field
Clear the program

counter and the assodata
counters 1-n of the
PIC field
Load program
counter with contents
of cwnter 1
fb
'
h
Move the contents of th6

counters one counter to
the left
yes
Fig. 2.38 The flow chart of the EAC queue assuming EAC queue in session (cont'd)
rn
Main Memory Module
Fig. 2.39
The proposed look-ahead pipeline computer system
90
are provided with local controllers which are responsible

for the functioning of each unit. The local controllers can
communicate with each other. The system contains five
general purpose registers: R,, RZ, R3, Rq, and R5. Data from
and to these registers are transferred by the common data
bus. Each register is associated with a program status
register that represents the condition of the value in the
register. The instructions enter the pipeline system through
the fetch unit. The address of the instruction to be fetched
is issued by a counter referred to as the program counter,
present in the fetch unit. The opcode of the newly fetched
instruction is checked to determine whether the instruction
is a branch instruction. The non branch instruction is
passed unchanged to the next stage. The branch instruction
is further classified as a conditional or an unconditional
branch instruction. For an unconditional branch instruction,
the program counter is updated with the destination address
from where the instructions are fetched in the pipeline
cycles that follow. The handling of conditional branch
instructions is explained in section 2.4.1. The fetch unit
can fetch two instructions simultaneously to reduce the
branch overheads. The individual data paths of the fetch
unit are termed as instruction streams. The current
operational instruction stream is determined by the logic
unit. The switching of streams is carried out by a path
controller in the fetch unit. The control information from
91
the logic unit is fed to this unit which in turn determines

the current stream. The instruction is forwarded to the
decode stage. The decode unit consists of a local FIFO queue
and an instruction decoder for each of the two streams of
the fetch unit. The instruction first enters FIFO queue and
then reaches the decoder.
The current operational stream
is determined by the logic unit and is the same as the fetch

unit. The individual streams of the fetch unit are disabled
if the corresponding queues in the decode unit are filled.
The decode unit splits the instruction into its fundamental
components namely the source operands, destination operands,
and the operation involved. This information is recorded in
the instruction status unit which is a part of the decode
unit and is common to both the streams. The function of this
unit is to supply information to other units about the
specification of the present instruction. The instruction
status unit is a two dimensional array that records the past
and the current history of instructions executed in the
pipeline system. The decode unit is controlled by the logic
unit and is disabled when a branch instruction is being
evaluated in the execution unit. The unit is explained in
section 2.4.2. The decoded instruction is forwarded to the
issue unit after all the relevant information about the
instruction is recorded in the instruction status unit. The
issue unit checks the instruction status unit to determine
dependencies between the current instruction and the
92
instructions that have been issued to the execution unit.

The issue unit schedules the execution time of the
instruction for resolving the hazards. The delay time is
calculated on the information provided by the instruction
status unit. The instruction is set to a certain format and
sent to the execution units. The issue unit is described in
section 2.4.3.
The delayed instructions are held in buffers
until the hazards are resolved. The delay stations monitor

the registers to capture the missing operands as they become
available. Data transfer from and to the registers is
carried by the common data bus. The arithmetic instructions
are executed by the arithmetic units and the logic
instructions are executed by the logic unit. These units are
also provided with controllers that monitor the units to
resolve structural hazards. The instructions are initiated
into their units when the appointed time slot has arrived.
The branch instruction is held in the logic unit until it
is resolved. During this time the issue unit along with the
decode unit is disabled. The fetch unit is not dependent on
the execution unit but is dependent on the condition of the
queues in the decode unit. The controllers between the
various units communicate with each other via the common
system control register which has fields associated with
each unit. These fields are write only fields for the
designated unit and read only for the remaining units. The
total system is illustrated in Fig. 2.40. The individual
Common data bus
Main memory unit
L
.
Counter set 1
Counter set 2
dassifier
---------- Opcode
and EA generator
Fetch unit
Path selector and antroller
I
1
Decode units
PIC Queue
system status units
Decoder
unit 2
I
Deader
unit 1
I
hstr~ction
status unit
7 7-
issue
unit I2
Issue units
-+
lssue
unit 1
(;'
Logic unit controller
R1
R2
R3
R4
Execution units
--
Fig. 2.40
The overview of the complete system
units are described in the following sections.

2.4.1
FETCH UNIT:
T h e fetch unit comprises of the logic t o fetch

instructions from memory, two sets of counters, an opcode
classifier, an effective address (EA) generator, and the
path controller. The path controller is also the local
controller for the fetch unit. The function of the opcode
classifier is to determine the type of t h e present
instruction. The fetch unit is capable of fetching two
instructions simultaneously. This is done to reduce the
branch overheads. Each set of counters consist of 10
individual counters which are identified as counter
counter 9. The counters referred to as counter
to
are used
as program counters in the individual sets. Each set

supports a single instruction stream. The PIC stream starts
as the current instruction stream. The instruction streams
end up into two FIFO queues in the next stage. The unit is
illustrated in Fig. 2.41. The counters of each stream are
initialized by the instructions passing through the opposite
stream. The counters 1 to 9 are filled with the destination
address held by the branch instructions that are awaiting
execution in the FIFO queues. The fetch unit is disabled if
the queues are filled with instructions. Individual streams
are disabled once the associated queue is full. Branch
instructions are held at the issue unit of the current
stream until the outcome is finalized.
The decode unit and
Counter 2. (EAC I PIC counten)

for path information
C : Control signals
to disable individual
streams
Fig. 2.41
The fetch unit
96
the issue unit are disabled but the queue is still filled
with instructions. The instructions of both the streams are
classified by the classifier and the various conditional
branch instructions a r e i d e n t i f i e d . W h e n a branch
i n s t r u c t i o n i s encountered in t h e P I C s t r e a m , t h e
destination address is calculated and placed in a counter
belonging to the EAC stream. The appropriate counter is
determined by the number of branch instructions that are
present between the present instruction in the fetch unit
and the branch instruction that is being currently evaluated
in the logic unit. Thusthe counter 1 of the EAC stream is
initialized by the address of the first branch instruction
in the PIC queue, with respect to the branch instruction
that is in the logicunit. At any instant of time there will
only be a single branch instruction, being evaluated in the
logic unit. Counter 2 (EAC stream) is loaded with the
destination address of the second branch instruction in the
PIC stream and so on. In general, the destination address
of the branch instructions are loaded into the counters of
the EAC stream in the same order as their physical presence
in the PIC stream. This allows the EAC stream to store all
the possible destination addresses. In the event of the
current branch instruction not being valid, the EAC queue
is flushed and the counter 0 is loaded with the value in
counter 1, along with the other address moving up one
counter to the left. This is shown in Fig. 2.42. The
s
I
T7-T-
Control Paths
C :
Control signals
to disable individual
streams
Fig. 2.42
The counters associated with each queue
A : Control signals
for path information
98
same procedure is followed by the EAC stream by loading the

counters 1 to 9 in the PIC stream.
If the branch is taken,
the EAC stream becomes the current stream and the PIC queue
is flushed along with the contents of the counters 1 to 9
of the EAC stream. The counter
of the PIC stream is loaded
with the address stored in counter 1 and PIC stream starts

fetching instructions starting from the new address. The
other address are also moved up by one counter to the left.
The address present in counter 0 corresponds to the branch
instruction in the E A C stream that is being
currentlyevaluated or the first jump instruction that will
be evaluated by the EAC stream. The path selector holds the
identity information and is responsible for loading of
addresses into the counters. The path selector monitors the
decode queue and disables the fetch unit or the individual
streams as necessary. The external control signals that are
needed by the fetch unit are:
change path, disable EAC
queue, and disable PIC queue.

2.4.2
DECODE UNIT:
The decode unit decodes the instruction and identifies

the sink and the source operands. The decode unit consists
of two instruction queues and two decoder units. Instruction
queue 1 is designated as the PIC queue and instruction queue
2 is designated as the EAC queue. The queues are FIFO in
nature. Two different instructions belonging to two

different streams can be simultaneously decoded by their
- field
Field
Exec
Code
Time
R1 R2
a d d 6
R3
R4 R5
C1
C2
C3
C4
C5
Op-code is the opcode of the Instruction.

Time: Time is the time required to execute the instruction.
R - Field :The field representing the registers in the system.
C - Field : Counter fields associated with the registers.
0: Register used as the sink (destination ) register.

1: Register used as the source register.
3: Not used in the instruction under consideration.
Fig. 2.43
Instruction
status
unit.
100
individual decoders. The decoded information is stored in

the instruction status unit. When the instruction status
unit is filled up, the current decoded information is stored
at the beginning of the array. In this manner the old
records are overwritten with the new ones. The roll over is
necessary so as to limit the size of the unit. Both the
streams use the same unit. For example let the instruction
read from the PIC queue be R1
R2
R3. In machine language
mnemonics it would be stated a s A D D R l 1 R 2 , R 3 . R 1 is

thedestination or the sink registers. R2 and R3 are the
source registers. The operation specified is ADD. The
decoded instruction in the instruction status unit would
read as inFig. 2.43. The binary digit 0 represents that the
associated register is used as a sink register. The binary
digit 1 indicates the source registers. The binary digit
three is used as a null variable. The decode unit is
controlled by a queue controller that monitors and controls
the FIFO queues. The controller is assigned the duty of
determining whether a particular queue i s full, in
operation, and flushing the redundant queue. The control
signals, flush queue one and flush queue two are needed by
the controller to flush the queues. The controller puts out
a control signal which indicate the queues which are full.
The unit is illustrated in Fig. 2.44.
2.3
ISSUE UNIT:
The issue unit issues the instruction to the execution
flush Queue 2
Instruction from
INSTRUCTION
Decoded instruction
to issue unit # 2
Decoded instruction
to issue unit # 1
Thick lines represent data lines.

Thin lines are the control lines.
Fig. 2.44 Decode Unit
Opcode
Fig. 2.45
ASRl
DSRl
SD1
ASR2
DSR2
SO2
Instruction format issued to the execution unit
DR
103
units. The function of the issue unit is to schedule the

execution of the instructions. Each stream contains its own
issue unit. At any instant of time, the operating issue unit
belongs to the stream that is current. The issue unit
controls the C field in the instruction status unit and the
delay is set according to information that is available. The
issue unit consists of: 1) logic capable of resolving the
RAW
and WAW hazards, and 2) the instruction router unit. The
issued instruction is formatted as shown in Fig. 2.45 and

is forwarded to the execution unit. The output of the issue
unit is made up of seven fields: 1) address of source
register one (ASRl), 2) operand of source register one
(DSRl), 3) source delay one (SDl), 4) address of source
register two, 5) data of source register two (DSR2), 6)
source delay two, 7) instruction delay (ID) and 8)
destination register (DR). The instruction is fetched from
the current queue in the decode unit and is simultaneously
fed to the main hazard resolving unit. The various delays
are computed and the instruction is formatted to be issued
to the execution unit in the next pipeline cycle. If the
operands are available, the operands are loaded in the
operand data fields and then issued. For example, let the
ADD Rl,R2,R3 be the present instruction encountered by the
issue unit. If Cl,C2,C3 are all zeros and Rl=R2=R3=5 then
the formatted instruction would read as displayed in Fig.
2.46. On the other hand if the Cl,C2,C3 are non zeros, the
Let R1 = 5 and R2 = 5.
+ R3, where R2 and R3 are available.
R1 = R2
OpcOde
ADD
Fig. 2.46
R1 = R2
ASRl
DSRl
SDl
ASR2 DSR2 SD2
R2
R3
DR
R1
Formatted instruction for 'add Rl,R2,R3' with no delay
+ R3, where
R2 and R3 are not available.
The delay associated with R2 and R3 is 3 and 4 respectively
Opcode
ASRl
DSRl
SO1
ASR2
DSRP SD2
ADD
R2
R3
Fig. 2.47
DR
R1
Formatted instruction with delay, forwarded to execution unit
lnstructions
from
EAC Queue
lnstructions
from
PIC Queue
lnstructions to the
Execution unit
N: Update counter fields in system status units.
M: Input of the counter fields from the system status unit
U: Common Data Bus.
V: Disable issue unit signal from logic unit controller.
Thick lines represent data flow

Thin lines are the control lines
Fig. 2.48 Issue
unit
106
delays have to be computed and such an instruction would be

forwarded to the execution unit as shown in Fig. 2.47.
Conditional branch instructions are handled in a different
manner. The issue unit calculates the time when the correct
result would be available forwards it to the logic unit. The
issue unit is then disabled along with the decode unit until
the branch instruction is evaluated. The issue unit is
illustrated in Fig. 2.48.
2.4.5
EXECUTION UNIT:
The execution unit comprises of three sub units namely:
1) dynamic fixed point arithmetic unit, 2) dynamic floating
point arithmetic unit, and 3) logic unit. The fixed point

arithmetic unit is a pipelined unit with seven stages. The
design is based on the carry save adder tree for multiple
additions of binary numbers. The behavior of the arithmetic
unit is dynamic and it c a n execute four different
operations:
add, subtract, multiply and divide without
reconfiguring the pipeline. It can also handle upto three

different arithmetic operations being processed in the
various stages in the same pipeline cycle. The arithmetic
instructions whose operations are multiplication or division
are allowed to enter the arithmetic unit at stage one.
Addition and subtraction instructions are introduced to the
pipeline at stage six. The results are uploaded to
destination registers at stage 7. Arithmetic instructions
issued by the issue unit that do not contain the appropriate
Instructions from Issue Unit.
t
Dynamic
Fixed
Point
Arithmetic
Unit
Dynamic
Floating
Point
Unit
Logic Unit.
*
i
Controller for the
Controller for floating
+
b Controller for
point unit
To common data bus.

Thin lines: Control signals to control the input and output
Thick lines: Instruction flow from issue unit.
Very thick lines: Output data lines to the common data bus.
DS: Delay Station.

Fig. 2.49 Dynamic
pipelined execution unit.
the fixed
108
operands are held at stage 1 or stage 6, depending on the

operation specified by the instruction. The floating point
unit is a repetition of the fixed point unit with some
external combination c i r c u i t r y t o take care o f t h e
additional processing. The logic unit is responsible for the
execution of logic instructions and the evaluation of branch
instructions. Branch instructions that have to be evaluated
are held at DS provided in the logic unit. Additional memory
elements are provided in stages oneand six of the arithmetic
units and stage three of the logic unit. These memory
elements store the instructions that are issued by the issue
unit until they are ready to be executed.The execution units
are illustrated in Fig. 2.49. Ten buffers are available at
the entrances to the execution unit. Every DB has equal
access to the execution unit. Priority numbers are assigned
to every DB in the reservation station. The priority number
is used to determine the instruction that has been waiting
for the longest time in delay station. In case of a conflict
between two incompatible instructions that require the use
of the same stage of the execution unit, the instruction
with the highest priority is executed. The priority numbers
are daisy chained as instructions are executed. Each delay
station operates independently of the others in the system.
The individual execution units are provided with
dedicated controllers which provide collision free execution
of instructions in the execution units. The controllers of
109
the arithmetic unit provide collision free execution of

instructions. The logic unit controller is responsible to
evaluate the pending branch instructions and direct the
instructions into their respective buffers. The controllers
communicate with each other by using a common control
register. The detialed design of the controller is beyond
the scope of this research.
The next chapter deals with the structure and design
of the fixed point arithmetic unit.
CHAPTER THREE
DESIGN OF DYNAMIC PIPELINE ARITHMETIC UNIT
3.1
INTRODUCTION:
The arithmetic functions are executed by the arithmetic

units in the execution unit. The design of these units
determine the throughput of the total system. The arithmetic
units are modelled after the static wallace tree structure
capable of performing multiple number additions. The
advantage is that the architecture is pipelined, and
modifications to increase the computing capabilities are
possible. T h e static unifunction pipeline has t o be
converted to a multifunction pipeline capable of handling
addition, subtraction, multiplication and division.
Individual functional units can be provided, but it leads
to increase and redundancy of the hardware. The design
criteria is to model a single arithmetic unit which is
capable of carrying out executions of different arithmetic
instructions at the same time. The algorithms for performing
the arithmetic operations are chosen so as to complement the
structure of the wallace tree.
The wallace tree is first described and the
modifications are carried out depending on each of the four
operations.
3.2
PRINCIPLE OF OPERATION OF THE CSA TREE:
Multiple number addition can b e realized with a
multilevel tree adder. The conventional carry propagate

adder
(CPA)
adds two inputs and yields a single output. A
carry save adder
(CSA)
receives three inputs and produces
two outputs called the sum vector

(CV).
The
CSA
(SV)
and the carry vector
is a full adder wherein the carry in bit is
treated as an element of the third input vector. The carry

out bit is treated as an output element of the carry vector
and is not forwarded to the next full adder. The sum vector
and the carry vector are treated as individual vectors and
are operated upon in the same manner.
n-bit
CSA
element
consists of n full adders where in the carry in bits of the

individual adders are used to enter the third vector and all
the carry out terminals act as the output lines of the carry
vector. The carry lines are not internally interconnected
in a carry save adder. The truth table of a single
CSA
element is illustrated in Fig. 3.1 along with the
CSA
element.
Mathematically the carry save adder is represented as:
A + B + D = S + C
where
(3.1)
+ is the arithmetic addition operation. The input
vectors are
A,
B, and
D.
The output vectors are
and
C.
The
total sum of the three input vectors are obtained by adding

the
vector with the
vector. The carry vector is shifted
one bit to the left compared to the sum vector. This

shifting of the carry bit is necessary to maintain the
correct placement of the vectors with respect to each other.
CARRY
SAVE
ADDER
UNIT.
SUM
CARRY
Fig. 3.1
The CSA element and its truth table
Fig. 3.2
CSA operation of adding three elements
In the process of summation, the carry bit of the lowest

significant bit is added along with the next higher order
bits. This principle is illustrated in Fig. 3.2. If it is
necessary to perform multiple number additions, then the
individual C S A elements are configured into stages of a
pipeline. The process of adding n vectors, m bits long is
carried out as follows. The input binary vectors are divided
into k groups consisting of three vectors. If the number of
vectors is not a multiple of three then the value of k is
equal to the highest number of groups that are possible. The
number of C S A elements that are required to start the
process are equal to k, m-bit CSA units.
The CSA elements
which perform the operations in parallel are grouped

together into a stage or a level. The ungrouped vectors are
passed undisturbed to the next stage.
The aim is to merge
the n input vectors into two vectors S and C, each 2*m bits
long. The process of merging is carried out in stages. The
number of C S A elements in a stage is equal to the highest
number of three vector groups that are possible from the
input vectors to that stage. The ungrouped vectors are
passed on to be processed in the next stage. The final
result is obtained by adding the last sum and the carry
vector. T h e relative order o f t h e vectors has to be
maintained through out the pipe so as to obtain the correct
result.
Let the eight vectors shown below be the shifted
114
multiplicands of two eight bit binary vectors wherein the
operation between them is multiplication. These partial
products are to be added to obtain the final product and
hence involve multiple additions. The leading and trailing
zeros are added to show the relative displacement of the
vectors to each other.
W1
0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 1
W2
0 0 0 0 0 0 0 1 0 0 0 1 1 1 1 0
W3
0 0 0 0 0 0 1 1 1 0 0 0 1 0 0 0
W4
0 0 0 0 0 1 1 0 0 1 1 0 0 0 0 0
W5
0 0 0 0 1 1 0 1 0 0 1 1 0 0 0 0
W6
0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0
W7
0 0 1 0 1 1 0 1 0 0 0 0 0 0 0 0
W8
0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0
W l t o W8 are the partial products of binary

multiplication of two vectors and in this example they are
treated as 16 bit binary vectors.
If we restrict the operation of the CSA tree to adding
multiple numbers, a wallace tree can be structured. In
general, a V
level wallace tree can add upto N(v) input
numbers, where N(v) is evaluated by the following recursive

formula [23] :
N(v)
0.5
N(v)
N(v-1)
mod 2 ) )
(3.2)
with N(1) = 3.
For example, we need 10 CSA tree levels to add 64 to 94
numbers in one pass through the tree.
M a t h e m a t i c a l l y f o r e i g h t o f t h e e i g h t b i t v e c t o r s we
need a f i v e l e v e l CSA t r e e . The l e a d i n g z e r o s a r e o m i t t e d
f o r t h e c a l c u l a t i o n s . The p r o c e s s of a d d i t i o n is i l l u s t r a t e d
b e l o w w h e r e i n SV r e p r e s e n t s t h e s u m v e c t o r
a n d CV
r e p r e s e n t s t h e c a r r y v e c t o r . The number of groups of t h r e e

v e c t o r s t h a t can be formed a r e two i n number (k=2) and hence
two e i g h t b i t CSA u n i t s a r e r e q u i r e d t o s t a r t t h e p r o c e s s .
Level 1:
The f o l l o w i n g i s t h e o p e r a t i o n i n e i g h t b i t CSA
u n i t #1:
The f o l l o w i n g i s t h e o p e r a t i o n i n e i g h t b i t C S A u n i t
#2:
A t t h e end of l e v e l one t h e r e s u l t s a r e t a b u l a t e d i n
t h e i r c o r r e c t order:
These v e c t o r s a r e forwarded t o t h e l e v e l 2 f o r f u r t h e r
processing. A t l e v e l two t h e r e a r e s i x b i n a r y v e c t o r s t o be
added. The number of
three vector groups that are possible
are two (k=2). They are 1) SVl,CVl,SV2 and 2) CV2, W7,W8.

The operation of the CSA units three and four are as
follows:
Level 2:
CSA unit number #3:
CSA unit number #4:
The results of level two are tabulated below :
These results are forwarded to level three
In level three, the three inputs SV3, CV3,and SV4 are

converted into two outputs which are then used to compute
the result. The operation of CSA unit five in level 3 is as
follows:
The results of level 3 are tabulated below:
The above results are forwarded to level 4 along with

CV4.
Level 4 :
The above vectors are forwarded to level 5 for the

final summation.
Level 5:
SUM
1 0 1 0 0 1 1 1 1 1 0 1 1 1 1 1
The complete structure of the pipelined unit is shown

in Fig. 3 1 3 . Anderson et al.
[22]
has used this concept and
modified it to suit the needs of IBM System 360/Model 91.

3.3
CONVERSION OF UNIFUNCTION STRUCTURE TO MULTIFUNCTION
STRUCTURE:
The structure discussed is a static unifunction

pipeline. It can carry out only the additions of multiple
44 C
C C
C C C
4 + 4
.:................................
:.:-:.:<...:.:.;<.:.:.:.:.:.;...
.-:;.:...
.;.;<...>. ,...
2,. ..............................
.......:.:?~::.':.y<::*:::::::::~:~~::::i:::*.:
.::.:~:,:>;<:i~>>;i:$i~iyj:.>j~p*::;<:::$*:.:j$:>ii:j:;::;:~i:i<>:*':jj<:.$:j:~>*:::;<:::
;
44 4
CSA UNIT
S1
-1
CSA UNIT
LATCH 1
-2
LATCH 2
S2
C S A UNIT
-4
C S A UNIT
+ .
-3
C4
S4
c 3 T
s3
................................................................................................................
.,..
:..-.:.::;:::::::
>.: .,::::;.,.
...................................
. . . . . . . . . . . . . . . . . . . . . ............................................................................
. . . . . .................
. . . . - .'.:<.:.>:;.............
. . . .....................
. . . . . .:s,;.
. .......
,
.., .:.::...:::,:::.:. -:-::>::.:>::.:;.
,
LATCH 3
4
S3
C S A UNIT - 5
S5
w ..................
C5
4 ..........
'. .......................
..................
........
.;.;. ..........
.......... .......
...............................................................
;.
..........................................................
.............. .............................
...>
....................................................
...................................
4
,,
...:;
..?
<<,,.x:A::::-:.:.?..
...................
...................
&.
""'.'i
4 4
~,
CSfl UNIT
C6
:::: :*
:.:.:.:.;.:.:
:::::;:y:.:::.:.:.
..................................................
:::,:
::l.:.:,.>:.:...
:.:.
..:.:
:,,:.
.:..
,:.:
:.::.
,..:.
..::>:
LATCH 4
-6
S6
-. . . . . . . . . . .
...........................................
............
. . . . ...........
. :........o:.;r.?,,y+x. .:............................
::::.:.:.A* . :...................................................
. ..r.. :.:.,.> ...........:, . ....i;.. ..: .r............
. :... . . . . . . . . .
LATCH 5
LATCH 6
TOTRL SUM.
Fig. 3.3
CSA
structure to s d d eight, 8-bit binary uectors
119
numbers which are N bits wide. The main aim of the design
is to modify the static structure to support the operations
of addition, subtraction, multiplication and division.
3.3.1
MODIFICATIONS DUE TO ADDITION AND SUBTRACTION:

The last stage of the CSA tree structure can support
addition and subtraction. The addition can be carried out

in the adder unit which sums up the two final vectors from
stage 5. Hence a path has to be created to load the two
vectors from the external sources. A multiplexer is
introduced to choose between the two streams. The changes
are illustrated in Fig. 3.4. The subtraction is carried out
using the two's complement method which means the number to
be subtracted has to be inverted on demand and a value of
one should be added to the least significant binary digit.
The operation of inversion on demand can be achieved by
using XOR gates and controlling one of the inputs. Hence a
XOR gate array is attached to one of the branches of the
external input data stream as shown below in Fig. 3.5.
3.3.2
MULTIPLICATION:
T h e
o p e r a t i o n
of
multiplication that is being attempted is very similar to

that of the decimal multiplication. When two binary vectors
of lengths m and n are to be multiplied, the final product
will be a vector with the maximum length of (m + n). Let a
vector A of length n be multiplied with another vector B of
length m. Each member of the vector B namely bj (for all
Operand #I
-V
MULTIPLEXER
Operand #2
External Data
Streams.
C A R R Y LOOK flHEAD
RDDER
I
Carry in
Fig. 3.4.
Operand #1
Changes to the pipeline due to addition
MULTIPLEXER
Operand #2
External Data
Streams.
C A R R Y LOOK AHEAD
RDDER
Select line
f o r inversion.
Fig. 3.5.
Carry in = 1
Changes to the pipeline due to subtraction.
121
j=O,m) is multiplied with each member of vector A namely ai

(for all i
0,n) to produce m such vectors called the
partial product vectors. The process is illustrated in Fig.

3.6 for m equal to 6 and n equal to 6.
p9
P1O
8'
p7
6'
p5
p4
p3
p2
Pl
Po
.........................................................
P
= {
PlOP9 P8 P7 P6 P5 P, P3 P2 PI Po ) = Total product.
Fig. 3.6.
Multiplication by multiple additions.
The values of i and j were chosen to be 6 and hence the

product vector has eleven elements. This process requires
six shifts and six adds to get the product. The partial
products are necessary to compute the result and they must
be shifted according to the weight of its multiplier namely
bj.
SHIFTED MULTIPLICAND GENERATOR.
W
1
W
2
W
3
W
4
W
5
W
6
W
7
W
8
Fig. 3.7 Changes to the pipeline by multiplication
123
MODIFICATIONS DUE TO MULTIPLICATION:
3.3.3
A stage is added at the top of the pipe to calculate
the shifted multiplicands of the input binary vectors. This

stage calculates the partial products of the two binary
vectors that are to be multiplied and presents them as
multiple vectors. The new stage is shown in Fig. 3.7.
3.3.4
DIVISION:
The division process is different from that of the
usual shift and subtract operation. The principle is based

on converting the shift and subtract, to shift and add
operation. In simpler terms, the division operation is
converted into a multiplication operation.
This operation is called the convergence method of
division and it is used in IBM 360/370 and CDC 6600/7700.
The method is described briefly below. We want to compute
the ratio (quotient) Q = N/D where N is the numerator and
D is the denominator. This process is being carried out in

the normalized
binary arithmetic form. Hence (0.5 <= N <
1) and (0.5 <= D < 1). In the original method N is always

less than D but this has been modified to accommodate even
if N > D. The only restriction placed on this method is that
both N and D must be normalized before any of the operations
can begin.
L e t R i f o r i = 1 , 2 , 3 , ..
. . . .n be
t h e successive
converging factors. One can select

(i-1)
R,
g 2
for i
1,2,. ..,k
where
= 1
D a n d 0 < 8 <=
0.5.
To evaluate the quotient Q we multiply both N and D by

R i , starting from i = 1 until a certain stage, say k.
Mathematically, we have as follows:
The value of the denominator D is substituted with 1 - 6 and

the resulting equation is shown below
((1-6)x R x R x R x
1
2
3
. . . . .
.X
(i-1)
Expanding Ri in terms of ( 1 + 6
i
= 1, 2 , 3 ,
given below:
as in equation ( 3 . 3 ) for
..... k t the above equation is modified
as
The denominator can be reduced to one term as shown in

the following:
2
N x (l+S)x (l+6 ) x (l+S
The value of
.x
1.
(l+6
(6) cannot execeed
(k-1)
)
0.5. Hence, the
*(k)
denominator term ( 1-6 ) will tend towards unity when the

value of k is sufficiently large. For an eight bit machine,
t h e accuracy of 0.996 can be achieved within t h r e e
iterations (k = 3). Thus the equation is approximated as
follows
A table is given below tabulating the convergence
sequence for the maximum value of
Iteration #
( k )
= 0.5
2k
6 2k
(1-
0.25
0.75
0.0625
0.9375
0.003906
0.996094
1.526 x 1 0 -
2.32 x 10
'
0. 9999874
O
0. 9999999
There will be an overflow if N > D and this is taken

care of as follows. Since both N and D are bounded by the
limits 0.5 and 1 where 0.5 is the lower bound and 1 the
upper bound. N and D can assume the value of 0.5 but cannot
assume 1. Hence if N > D then N/D can be represented as I1
f r a c t i o n ' w h e r e i n t h e f r a c t i o n i s l e s s t h a n 1.
Mathematically we have
Q
where B = N
N/D = (D + B)/D
(B/D)
D. The operation of B/D is carried out by the
convergence method. The total result can be obtained by

initialising an overflow bit. This overflow bit must be
taken into consideration, when the result of this operation
is required for further operations.
3.3.5
MODIFICATIONS DUE TO DIVISIONS:
The implementation of division by convergence method

using the wallace tree is carried out by splitting the
process into iterations. Each iteration computes the new
partial product and the number of iterations depends on the
convergence factor DELTA. The process is explained below
mathematically.
L e t Pp
L e t Pg
= -P2x (
P ~ X(
2
+ b )
4
1 + 6 )
1
( 3 8)
(3-9)
substituting the value of P3 in the equation (3.6)
127
It is easy to see that the pipeline has to be modified

to achieve division operation. From the discussion above,
the partial product Pk is calculated by placing Pk-, and
as the two input arguments to the pipeline. In the
first iteration, the convergence factor is easily calculated
but in the following iterations the power of delta rises by
the power of 2k
. Hence
the higher powers of delta for the
next iteration is calculated by multiplying the present

value of delta with itself. The purpose of calculating delta
is two fold: 1) for providing the second argument for the
next iteration at the top of the pipe and 2) to find out
whether the convergence has been achieved. The calculation
of the second argument for the next iteration involves delta
being calculated twice at stage six in consecutive clock
cycles. The second delta is used for testing the convergence
and also to realize the next higher power of delta.
The process of calculating the value of delta twice in
stage six is achieved by placing a latch in stage 5 and
holding the value of delta for an extra clock cycle. Hence
there will be an addition of a new latch to the pipeline at
stage five. There is another change that has to be carried
out to ensure the successful operation of division. The
partial product that comes out of stage six cannot be fed
back into the pipe for the next iteration because the second
argument is not yet available. Hence another stage is added
to hold the value of the partial product until the second
128
argument becomes available. The changes made are shown in
Fig. 3.8. Thus the pipe is converted from a unifunction pipe
to that of a multifunction pipe capable of"dynamic behavior
as shown in Fig 3.9. The dynamic operation depends on both
the hardware and the control schemes for its successful
operation. The control is based on the hardware and the
details of how the control was realized is explained in the
next section.
3.4
DYNAMIC EXECUTION OF INSTRUCTIONS :
The dynamic scheduling of data in a multifunction

pipeline is essential for the successful operation of the
pipeline. The scheduling algorithm maintains collision free
execution of instructions in the pipeline system. The
development of such procedures has been studied by several
researchers. Ramamoorthy and Li [15] have shown that general
problem is intrinsically difficult
NP complete class of problems
and is a member of the
It is conjectured that any
such problem of this class has no fast solution, that is,

a scheduling algorithm is not a polynomial function of the
number of items to be scheduled. Ramamoorthy and Li [16],
Ramamoorhty and Kim [lo], and Sze and Tou [17], have studied
the suboptimal scheduling algorithms and their
characteristics with mixed results. The work performed by
Davidson [18], Shar [19], Pate1 and Davidson [20], and
Thomas and Davidson [21] is taken as the foundation and the
scheduling of the dynamic pipe is developed from it.
CSfl UNIT 6
MULTIPLEHER
MULTIPLEXER
CARRY LOOK AHEAD
Fig. 3.8. Changes to the pipeline due to division.
LATCH 2
CSA UNlT
S2
-2
LATCH 3
S3
-4
CSA UNIT
CSA UNIT
- . . .-
-3
...
..:.
::+::.:::,,,:
. .::;::::I:.:;5:::I:::j:;:;:::;:::i::: . . .,j:::::::;:::;:F:....
...
.........................
.........................................
:~~j~:i:::~~~~:~:~~~~~:~.wlj.:::~~3j~~~~j.;::P,:i'::::~:2~::~jj~~,~:,,~:~:~:~~~~:~:~::t;j.:~~~~:~I~~~~:.;.~~j
..........................
I
$4
CSR UNIT
CSR UNIT- 5
............
-6
Inverter
CARRY LOOK
AHEAD ADDER
C o n y in.
P= A'B.
Fig. 3.9
Multifunction eightbit arithmetic pipeline unit.
LATCH 4
131
3.4.1
COLLISION VEXTORS:
Collision vectors are binary vectors that are derived

from the latencies of a given pipeline. Latency is defined
as the number of pipeline cycles between two successful
initiation of instructions in the pipeline. Initiation is
defined as the process wherein an instruction is fed at the
input s t a g e o f t h e pipeline system. An initiation
corresponds to the start of the computation of
a single
function. The latency is an integer and is bounded

theoretically from 0 to infinity. Static pipelines are
forbidden to have a latency of 0 as two simultaneous
initiations cannot be performed. The latency can also be
derived from reservation tables. The reservation table is
a two dimensional array representing the usage of the
pipeline stages by the function. It represents the flow of
the instruction from the first stage to the last stage.
Every function has its unique reservation table. The latency
of a given function is determined by using two reservation
tables of the same function. The first reservation table is
shifted one clock cycle to the left and placed on the second
reservation table. If there are any common stages between
the two tables, there is a chance of collision and that time
cycle is a forbidden latency. The shifting and overlaying
is carried out for all the pipeline cycles that the
instruction remains in the pipeline. A static pipeline is
capable of executing only a single function.
132
The collision vectors are unique for any given

function. The cross collision vectors depict the times of
all possible initiations within a time period. It is used
to schedule the instruction flow in any pipeline. They are
derived as follows:
The length of the collision vector of a function is
equal to the difference between the time it is initiated and
the time the final result is obtained. If a function is
present in the pipeline for 10 pipeline cycles, the
collision vector would be 10 bits long.
The elements o f t h e binary vector are assigned
according to the latency sequence. All available latencies
are assigned as 0s and all forbidden latencies are assigned
as Is. The latency sequence is derived by initiating the
function into the pipeline and calculating all the available
latencies for initiating the same function. The available
latencies are bounded by the maximum time that the initiated
function remains in the pipeline. If the function remains
in the pipeline for 10 pipeline cycles, the latency sequence

can be at the most, 10 elements. If the latency sequence of
a pipeline is (0,1,4,6,8) out of a possible 10 clock cycles
then the collision vector will be
00110101011
).
Collision
vectors are used to derive the cross collision matrices for

the dynamic pipeline. The collision vector for the
reservation table in the Fig. 3.10 is listed below:
c l = ( 1 1 1 1 0 0 1 0 )
STAGE 4
STAGE 5
Fig. 3.10
X
X
A sample reservation table
134
Based on the above concepts, the scheduling of the

pipeline is derived in the next sections.
3.4.2
DESIGN OF CROSS COLLISION MATRICES:
The scheduling of instructions in a dynamic system

cannot be performed by a single collision vector. This is
due to the fact that more than one function is executed in
the pipeline and a single vector cannot represent all the
available latencies of all the functions. The scheduling of
data in a dynamic pipeline is carried out by using a matrix
instead o f a vector. T h e m a t r i x i s c a l l e d t h e c r o s s
collision matrix. The design of the cross collision matrix
is based on four collision vectors. These collision vectors
belong t o t h e four arithmetic operations: addition,
subtraction, multiplication, and division. The collision
vector for the individual functions is derived on the basis
of the reservation table associated with each function. The
collision vectors for the four functions are derived below:
The operation of both addition and subtraction are very
similar to each other. The only difference between addition
and subtraction is that in subtraction, one of the operands
is inverted and the carry in is equal to 1. All this can be
achieved in one clock cycle and therefore no additional
stage is required. The reservation table for addition and
subtraction is shown in Fig. 3.11.
The collision vector for the function of addition and
subtraction is as follows:
135
cad,
={
1 0 0 0 0 0 0 0 0 }
(3.11)
'sub
={
1 0 0 0 0 0 0 0 0 }
(3.12)
The multiplication operation is initiated at the top

of the pipeline. The function once initiated passes through
t h e v a r i o u s stages t o t h e end o f t h e pipeline. T h e
reservation table for the operation of multiplication is
given in Fig. 3.12. From the reservation table it is found
that the latency is 1 and that the collision vector closely
resembles that of addition and subtraction. The collision
vector for multiplication is as follows:
Cmlt
1 0 0 0 0 0 0 0 )
(3.13)
The collision vector for division involves two sets of

computation. The first set is the generation of partial
products and the second set is the generation of the new
convergence factor. This is due to the fact that from
equation (3.5). we have in the convergence method, two
products to be calculated for each iteration. Recalling the
convergence equation (3.3) we see the need to calculate the
new quotient and the convergence factor for each iteration.
The reservation table for this function is the
combination of equations (3.4) and (3.7). First, the partial
quotient is calculated by introducing one of the arguments
N as one of the inputs and the second input is 1+
This
gives the new partial quotient namely N* (1+
is illustrated in Fig. 3.13. To calculate
^2 for the next
iteration, the value of
This process
is initiated as the two inputs
Stage
Stage
Stage
Stage
Stage
Stage
Stage
Fig. 3.1 1
Reservation table for addition and subtraction.
Stage
Stage
Stage
Stage
Stage
Stage
Stage
Fig. 3.1 2
Reservation table for multiplication.
137
which immediately follows the initiation of the previous
function. This is carried out to have all the operands
available for the next iteration without any undue delay.
The value of delta is held in stage five for two consecutive
clock cycles because of the necessity of obtaining the new
values of 1
)A
and (
)A
. This is illustrated in
Fig. 3.14. In the reservation table the flow of the partial

products is marked with a X and that of delta is marked with
an 0. Even though they are two different subfunctions of the
same main function they are combined together into one
reservation table. The reservation table for division is
shown in Fig. 3.15. The collision vector for division is
given below:
Cdiv = {
1 1 1 0 0 0 0 0 )
(3.16)
DESIGN OF THE CROSS COLLISION MATRIX :
A cross collision matrix is a r x d binary matrix where
r is the number of reservation tables and d the maximum of

the clock cycles of all the tables. In our design the value
of r is 3 and the value of d is 8. The cross collision
matrix represents a state of operation of the pipeline. The
steps of designing the initial cross collision matrices are
as follows [23]:
Step 1:
There are r initial states for the r reservation
tables. The table i which assumes the first initiation at

clock cycle 0 is of the type i.
Step 2:
The jth row of the ith matrix CMi is the collision
Time
Fig.
3.13
Reservation table for partial product of N
(I+6
(k)
Fig. 3.14
2
Reservation table for delta products. ( 6
).
Fig. 3.15
The reservation table for the convergence method
140
vector between an initiation of type i and latter initiation
of t y p e j . T h u s C M i ( j ,k) i s 0 only if shifting the
respective reservation tables j , k places right, and
overlaying them on a copy of the reservation table i f
results in no collision. k denotes the number of clock
cycles from the initial clock cycle 0, when an initiation
of the function j is desired.
Step 3:
In all cases the ith row of CMi is the same as the
initial collision vector o f t h e function i.
I t is
equivalent to the reservation table i used in a static

configuration. The other rows are called cross collision
vectors.
Fig. 3.16. shows a sample initial collision matrix for
an operation i. The number of rows depends on the total
number of distinct operations that can be performed by the
pipeline system. In our research the pipeline can perform
three distinct operations and hence the number of rows are
three in number. Let the sample matrix represent the initial
collision matrix of operation divide which is tagged as
operation number 1. Row1 is the initial collision vector of
division operation. Row2 is the cross collision vector
between operation 1 and operation 2. Row3 is the cross
collision vector between operation 2 and operation 3.
The initial collision matrix for division operation is
derived f r o m t h e collision vectors of addition
subtraction, multiplication and division. The formulation
The number of columns is equal to maximum compute time of operations 1 to n
Cross collision vector between operation (1) and operation i

row 1
row 2
row 3
Collision vector for the operation (i.)
row i
row n-1
Cross collision vector between operation (n-1) and operation (i)l
Cross collision vector between operation (n) and operation (i)
Total number of operations that can be performed by the system = n
Fig. 3.16
Structure of an initial cross collision matrix for operation i.
142
of the initial collision matrix for division is chosen to
illustrate the process. The operations are to be tagged as
i, i+l
, i+3 and so on. Each operation has its own initial
collision matrix. In our research the operation of division

is tagged as 1, the operation of multiplication is tagged
as 2, and the operation of addition and subtraction are
tagged as 3. The assignment of the tags is of no consequence
and they can be assigned as desired. Care should be taken
to assign the same numbers to the initial collision matrices
which are associated to each operation. In this case, the
collision matrix 2 represents the initial collision matrix
for multiplication. The rowl of the matrix 1 should be the
cross collision vector of operation 1. The row 2 of matrix
two should also be the cross collision vector of operation
2. T h i s is the same case with all t h e initial cross
collision matrices.
The rowl of the matrix 1 will be the cross collision
vector of division operation. The elements of rowl are
1 1 0 0 0 0 0 ) . The row 2 represents the cross collision
vector between multiplication and division. The elements of

this row are obtained by s l i d i n g t o the right the
reservation table of multiplication in Fig 3.12. over the
reservation table of division in Fig. 3.14 and determining
the available latency sequence. The available latency
sequence between division and multiplication are
)
and hence the elements of row 2 are
3,4,5,6,7
1 1 1 0 0 0 0 0 ).
143
Using the same process the cross collision vector between

addition and division is obtained and the elements in row
3 are { 0 0 0 0 0 1 1 1 ) . The resulting matrix is given in
Fig. 3.17.
In the proposed pipeline system, there are three
distinct functions which produce three initial collision
matrices and they are presented in Fig. 3.18. The state
diagrams are generated using the initial collision matrices.
GENERATION OF STATE DIAGRAMS:
3.4.3
The state diagrams represent the condition

of stage utilization of the pipeline at any instant of time.
The state diagram gives the controllers useful information
about the state of the system. At each pipeline cycle, the
pipeline configuration corresponds to one of the states.
The generation of state diagrams follows the steps given
below:
Step 1:
Each initial cross collision matrix is a single
state. The initiations are controlled by the elements of

individual columns.
Step 2:
column
The next state is determined by looking at the

of the present collision matrix. For every 0 in the
column there can be an initiation. The function that can be

initiated depends on the row where
present in row i then an initiation
possible. This means
occurs. If a
is
for function i is
that the new initiation of function
i will not collide with any of the previous initiations.
Fig. 3.17
Initial cross collision matrix for division
Cross collision matrix

for division

for multiplication

for addition or subtraction
Fig. 3.18
Initial cross collision matrices for the three operations
145
However, this does not guarantee that it will not collide
with any other initiations that may be possible at the same
time. For each initiation there will be a new state. The new
state is determined by ORing the present collision matrix
with initial collision matrix corresponding to the function
i.
Step 3:
The compatible initiation set is determined as
follows: The compatible initiation set is basically the set

of functions that can be started at the same time without
any collisions. This is equivalent to placing the associated
reservation tables, one on top of the other and forming a
composite overlay and ensuring that there are no matches.
STEP 4:
For a single initiation, the generation of a new
state is explained as in step 2. If the column 0 contains

more than one 0, then multiple initiations of functions are
possible. When multiple initiations are required, the
functions are first checked to see whether they belong to
the class of compatible initiation set. If functions are
compatible , the new state is generated by ORing the present
state matrix with the combined collision matrix. The
combined collision matrix is derived by ORing all the
individual initial collision matrices representing the
functions that are to be currently initiated.
STEP 5:
If no initiation is possible in the present cycle,
the collision matrix is shifted one place to the left and

leading zeros are introduced from the right. The pipeline
146
remains in the present state. The column 1 now becomes the
column 0 and the step 1 to step 5 are again followed.
STEP 6:
All the new states from the present state matrix
have to be generated. considering the column 0, step 1 to

step 4 are carried out for all permissable initiations. If
no initiations are possible then step 5 is adopted. After
all the new states have been derived from column 0, the
state matrix is shifted one column to the right as in step
5. This process is carried on until all the columns in the
present state matrix have been processed. At the end, the

present state matrix will be a zero matrix.
Step 1 to step 6 are carried out for all the state
matrices that have been generated. This process is stopped,
when there is a state already existing, for every possible
initiations from other states. The state diagrams are linked
to each other by arcs. These arcs are labelled. The
labelling represents the function initiated and the latency.
All the states have to be derived so as to enable the system
to move from one state to another after each initiation.
The compatible initiation set is defined as a set of
functions that can be initiated as a single function or as
multiple functions, and cause no collisions between
themselves. The compatible initiation sets for the pipeline
under consideration are as follows:
I1 = { Addition ) .
I2 = { Subtraction ) .
I3 = { Multiplication ) .
I4 =
Division
I5 = { Multiplication, Addition )
I6
= {
~ultiplication,Subtraction
I7
= {
Division, Addition
I8 = { Division, Subtraction )
The functions are tagged with the following

number representations:
Addition and Subtraction => 1.
Multiplication =>
Division
2.
=> 3.
In the state matrices, the row 1 corresponds to the

division operation. The row 2 corresponds to multiplication.
The row 3 corresponds to addition or subtraction.
A 0 in row 1 implies that an initiation is possible for

the division operation. Similarly a 0 in row 2 implies that
an initiation of multiplication is possible. If there are
two 0s in a column in row 1 and row 3, then both division
and addition (or subtraction) are possible at the same time.
Assuming the initial collision matrix for the division
operation as the current state, the new states are derived
using the steps described above. This example will show how
the various new states are developed. The state matrix is
presented below in Fig. 3.19.:
Fig.
Initial state matrix which is the initial

cross collision matrix for division operation
3.1 9
The allowable latencies for each of the rows are listed

below in Fig. 3.20. T h e maximum t a b l e c o m p u t e t i m e is 8
clock cycles.
Fig. 3.20 The allowable latencies for the state matrix.
Looking at the columns, the compatible initiation set

for each column is shown in Fig. 3.21:
Column
Initiation set
= {
1 )
Column 1: Initiation set
= {
1 )
= {
1 )
= {
1,2,3,{1,2),{1,3))
0:
Column 4: Initiation set =
1,2,3,{1,2),{1,3))
Column 5: Initiation set =
2 )
= {
2 )
= {
,
,
Fig. 3.21.
2 )
The available initiation sets
For latency 0, the allowable initiation is addition

or subtraction only. For latency 3, all the compatible
operations are possible. In this example the initiation that
will be illustrated for this latency is
2 ) . Hence
this example will cover both the single initiation as well

as multiple initiations. Listed in Fig. 3.22 are the initial
collision matrices of addition and multiplication
respectively.
Latency
The state matrix is not
shifted to the left,
as the initiation is occurring at latency
0.
The new state
is derived by ORing the present collision matrix with the

initial collision matrix of addition. The resulting
collision matrix is the new state after the initiation of
addition. The operation and the result are listed in Fig.
Initial collision matrix

for multiplication
Fig. 3.22

for addition and
subtraction
lnitial collision matrices for single initiation
Initial state matrix

for addition and
subtraction
The new state matrix
Fig. 3.23
Generation of the new state matrix for single initiation
The shifted state matrix
Fig. 3.24
The new state matrix obtained by shifting left three times
Initial cross collision

matrix for multiplication
Fig.
3.25
3.26
Combined cross
collision matrix for
dual initiation
Process of deriving the combined initial cross collision matrix
The initial state

matrix
Fig.
Initial cross collision

matrix for addition
Combined cross
collision matrix
The new state matrix
The new state matrix for combined initiation of distinct functions
152
The process of generating the next state for the double
initiation is not
different from that of the single
initiation. The present state matrix is shifted to the left

by three columns for a latency of three. Leading zeros are
introduced at the right. The shifted initial state matrix
is shown below in Fig. 3.24. The two initial collision
matrices for addition and multiplication are ORed together
to generate the combined initial collision matrix for the
double initiation. The new state matrix is derived by oring
the current state matrix with the combined initial collision
matrix. The operation and results are shown in Fig. 3.25 and
Fig. 3.26.
The remaining new states are constructed in this
manner.
It should be noted that if more than one initiation
is possible in a column then it is necessary to derive new

states for each of them. All the possible states from the
initial state matrix are illustrated in Fig. 3 -27. The
transition between two states is marked by arrows. The
arrows are labelled on the top by the latency and at the
bottom by the initiation set.
1: Division operation
2: multiplication operation
3: addition or subtradion operation
Fig. 3.27
Plain number is the

latency
[ 1 is the cperation
initiated
The possible states from the initial collision matrix for division
CHAPTER FOUR
INSTRUCTION EXECUTION I N THE PIPELINE SYSTEM
The execution of instructions in the pipeline is

governed by the issue unit and the controllers in the
execution unit. The instructions are scheduled by resolving
the RAW, WAW and the operational hazard. The general
procedure for the execution of instructions in our proposed
system is recapped below:
An instruction is fetched during every pipeline cycle
from memory by the fetch unit. The fetch unit classifies the
instruction and generates an effective address ( E A ) if
needed. The EA is loaded in the appropriate counter. Two
instructions are fetched from memory when both the streams
are active. The fetch unit feeds two queues in the decode
unit. The individual streams are disabled once the
corresponding queues are filled.
The decode unit decodes the instruction and places the
information in the R field of the system status unit. The
decode unit is disabled if a jump instruction is awaiting
evaluation in the logic unit.
The function of the issue unit is to detect the hazards
and resolve them according to the algorithm developed in
C h a p t e r 2 . T h e i s s u e u n i t is d i s a b l e d i f a n y j u m p
instructions is being evaluated in the logic unit. The issue
unit assigns delay to the instruction and routes them to the

appropriate execution unit.
The instructions that are to be delayed are stored in
buffers provided at the input stages. The controller of the
execution unit is responsible for resolving the operational
hazards. The controller also checks for the available
latencies to accommodate instructions that are ready to be
executed, in the present pipeline cycle. The controller for
the arithmetic unit provides a feedback to the instruction
status unit. This feedback is in the form of updating the
counter fields, depending on the state of the system. The
controllers are also responsible for uploading the
destination register with the result of the instruction, as
soon a s they a r e out o f t h e pipe. A s a m p l e set of
instructions are listed below. The operation of the pipeline
is illustrated by executing the sample set. The process of
execution is demonstrated by displaying the flow of
instructions in the pipeline during each pipeline cycle
until all the instructions are executed.
following set of instructions:
load
load
load
add
store
add
mult
jnz
....
store
rl, 20;
r2, 30;
r3, 40;
r4, rl, r2;
k, r4;
r4, r2, r3;
r5, r2, r3;
r5, 60;
Consider the
156
The execution of each instruction is displayed in the

following figures and a brief explanation is provided for
each cycle. The issue unit schedules the instruction one
pipeline cycle later than the decode unit and hence a column
is provided in the instruction status unit which shows the
current instruction being issued. The fields are captioned
in the diagrams for easy identification.
Pipeline cycle # 0:
The instruction 'load
rl, 2 0
is fetched by the
fetch unit. The instruction is not a jump instruction and

hence the counters o f t h e counter set 2 are left
undisturbed. The current stream is still the PIC stream.
This is shown in Fig. 4.1.
Pipeline cycle # 1:
The second instruction is fetched from memory. The

first instruction is forwarded to the PIC stack and is
decoded by the decoder. The decoded information is recorded
in the instruction status unit. The second instruction
load r2, 30
is also not a jump instruction. The first load
instruction is initially loaded into the bottom of the PIC

queue. The bottom location is directly connected to the
decoder unit. As a result, the first instruction is decoded
and the decoded information is placed in the instruction
status unit. The Fig. 4.2 and 4.3 illustrate the presence
of the instruction in stages 1 and 2.
Counter 1
load R1, 20;

Fetch unit
Counter 2
EAC
Queue
PIC
Queue
der unit 1
Decoder unit 2
issue
unit 2
lssue
unit 1
Logic
unit
Fixed
point
unit
Floating
point
unit
Register
set
Fig. 4.1
The state of the system at pipeline cycle # 0
Counter 1
load R2, 30;
Fetch unit
Counter 2
EAC
Queue
PIC
Queue
Decoder unit 1
Decoder unit 2
lssue
unit 2
lssue
unit 1
Logic
unit
Fixed
point
unit
Floating
point
unit
Register
set
Fig. 4.2

R Field : R field is the field of all registers in the system..
C Field : C field is the field of the counters that keep track of the registers.
Fig. 4.3
State of the instruction status unit during pipeline cycle # 1
160
Pipeline cycle # 2:
The first instruction is in the issue unit. The
contents of the counter cl which represents the register rl,
is zero. This implies that there is no RAW or WAW hazard.
The Tinst-delay is calculated as follows:
Csink(old)
'test
= O
From equation (2.14) Tinse.delay

= 0.
~ccordingto equation (2.17)
The issue unit issues the instruction 1 to the logic unit

without assigning any delay. The instruction 2 is in the
decode unit and instruction 3 is in the fetch unit. The
instruction 1 would load the register rl with the new value
after it has been executed by the logic unit. The total time
that the load operation would need to execute is 6 pipeline
cycles and hence the counter cl is set to 6. The current
value of cl denotes the number of pipeline cycles needed
(with respect to the present pipeline cycle) for rl to be
loaded with 20. Fig. 4.4, and 4.5 illustrate the state of
the system.
Pipeline cycle # 3:
The instruction 1 is in the logic unit. Instruction 2
is in the issue unit. Instruction 3 is in the decode unit
and instruction 4 is in the fetch unit. Instruction 4 is an
arithmetic instruction. The counters of the EAC stream are
Counter 1
load R3, 40;

Fetch unit
Counter 2
EAC
Queue
Decoder unit 2
Decoder unit 1
NO RAW hazard for lnstr # 1
II
m
Issue
unit 2
No WAW hazard for lnstr # 1

I
Issue instr # 1 for execution

--
lssue
unit 1
Route instruction to logic unit

i
Fixed
point
unit
Floating
point
unit
Register
set
Fig. 4.4
II

R - Field : R field is the field of all registers in the system..
C - Field : C - field is the field of the counters that keep track of the registers.
Fig. 4.5
163
still not operational. The instruction 2 is checked for the

hazards and is issued to the logic unit. c, is decremented
as it represents one less pipeline cycle for rl to be loaded
with the new value. The instruction 2 has no hazards and is
issued to the logic unit by initializing the counter c2, to
a value equal to 6. The delay is calculated as in pipeline
cycle # 2. These are illustrated in Fig. 4.6 and 4.7. The
counter values in the instruction status unit are listed
below:
c1=5,c2=6
Pipeline cycle #
4:
This is similar to pipeline cycles 2 and 3. The

instruction number 3 is in the issue unit. The initial value
of the counter c3 is 0. NO delay is needed. The instruction
is issued to the logic unit. Fig. 4.8 and 4.9 illustrate the
data flow for this cycle. The instruction in the fetch unit
is not a branch instruction.
Pipeline cycle # 5:
The fifth instruction is in the issue unit. RAW hazard
is detected, as the previous instructions ( #1 and #2) have
not yet initialized the registers with the new value. From
the system status unit cl = 3 and c2 = 4. There is no WAW
hazard as c4 = 0. The delays are calculated as follows:
csource- reg1
Csource- reg2
- 4
Csink(old)
= 0
Counter 1
add R4, R1, R2;
Fetch unit
Counter 2
EAC
Queue
PIC
Queue
Decoder unit 2
Decoder unit 1
No RAW hazard for lnstr # 2

m
'
lssue instr # 2 for execution

lssue
unit 2
lssue
unit 1
Logic
unit
Fixed
point
unit
Floating
point
unit
Register
set
Fig. 4.6

C Field : C - field is the field of the counters that keep track of the registers.
Fig. 4.7
Counter 1
store (K), R4;
Fetch unit
Counter 2
EAC
Queue
Decoder unit 2
Decoder unit 1
No RAW hazard for lnstr # 3

Y
'

I I H I I - - - - - - - -
lssue instr # 3 for execution

lssue
unit 2
lssue
unit 1
Logic
unit
Fixed
point
unit
Floating
point
unit
Register
set
Fig. 4.8

Fig. 4.9
Tsrc-delay
= 5,
'test
from equation (2.20)
= 5 + 3 - 1 = 7 ,
From equation (2.29),

*test
'
Csink(old)
Tinst-delay
and
Ttest
> 2.
Csink(old)
Tsrc-delay = 5'

Csink(new)
- T
+ (Tsr,-d,,a,
- 1)
= 3
4 = 7,
The instruction is issued to the DS in the stage 6 of the

arithmetic unit. The counter c4 is initialized to 7. The
register r, will be initialized
to the new value after a
period of 7 pipeline cycles. The instruction in the fetch

unit is not a branch instruction. The process is illustrated
in Fig. 4.10 and 4.11.
Pipeline cycle # 6:
The present instruction in the issue unit is the first
store instruction. The execution of this instruction has to
be delayed. The execution is delayed by 7 pipeline cycles.
The delay is computed as shown in the previous cycle. This
instruction will be held in DS at the UJ unit until the time
the instruction delay counter counts down to zero. The state
of the pipeline is shown in Figs. 4.12, 4.13, 4.14 and 4.15.
Pipeline cycle # 7:
The second add instruction is in the issue unit. The
sink register is r,. The previous write process for r4 has
not been completed. The counter c4 from the instruction
status unit is equal to 5 pipeline cycles. This denotes that
Counter 1
add R4, R2, R3;
Fetch unit
Counter 2
EAC
Queue
Decoder unit 1
Decoder unit 2
t
I
lssue
unit 2
RAW hazard for lnstr # 4

u

I
Instruction delay set equal to 5
Route instruction to fixed point unit
Issue
unit 1
Logic
unit
Fixed
point
unit
Floating
point
unit
Register
set
Fig. 4.10

C - Field : C field is the field of the counters that keep track of the registers.
Fig. 4.11
Counter 1
Fetch unit
Counter 2
EAC
Queue
HEl
PIC
Queue
add R5 R2 R3.
Decoder unit 2
Decoder unit 1
I - _ - - - - - -

Issue
unit 2
lssue
unit 1

i
Logic
unit
Fixed
point
unit
Floating
point
unit
Register
set
Fig. 4.12

R - Field : R field is the field of all registers in the system..
C - Field : C field is the field of the counters that keep track of the registers.
Fig. 4.13
Unit 4
Unit 5
Unit 6
Unit 7

ASRl : Address of source register 1 ASR2 :Address of source register 2
Fig. 4.14
State of the delay station in stage 1 of the AU during pipeline cycle # 6
Unit 4
Unit 5
Unit 6
, Unit 7

ASR1 : Address of source register 1 ASR2 :Address of source register 2
Fig. 4.15
174
the previous instruction initializing r4 will not complete

execution for another 5 cycles. The present instruction need
not be delayed for 5 cycles. The instruction delay is
computed as shown below.
- 2
Csource- reg1
Csource- reg2 = 3
Csink(old)
Tsrc-delay =
'test
'test
' Csink(old)
(Tsrc-delay
and
1)
(Ttest
3+4-1
Csink(old)
= 6,
) = I
using equation (2.27)

Tinst-delay
(Tsrc-delay
+1)=4+1=5.
using equation (2.30)

Csink(new)
Tsrc-delay =
7'
The new value would be loaded into the register after 7

cycles. T h e counter c4 is re-initialized t o 7. The
instructions that are using the previous value of r4 as a
source operand are all referenced to time when the old value
(result of the previous add instruction) would be loaded
into r4. Thus as soon as the old value is loaded into r4, the
buffers that need it only will capture it.
Once the data
is captured, the buffer would not reload until it is reset.

The instruction in the fetch unit is a jump instruction
based on the result of r5. If the branch is positive, the
destination address is the instruction with the label # 60.
The destination address is loaded into the counter 0 of the
Counter 1
jnz R5, 60;
Counter 2
Fetch unit
Branch
EAC
Queue
Decoder unit 2
Decoder unit 1

L
-I------
WAW hazard for lnstr # 6

m

Issue
unit 2
Issue
unit 1
Logic
unit
Fixed
point
unit
Floating
point
unit
Register
set
Fig. 4.16
mult

Time: Time is the time required to execute the instruction.'
Fig. 4.17
Unit 4
Unit 5
.Unit
, Unit 7
SD2 :Source Data 2.
Fig. 4.18
Unit 4
Unit 5
Unit 6
, Unit

I D : Instruction Delay.
Fig. 4.19
Unit 4
Unit 5
Unit 6
Unit 7
2

DSRl : Delay of source register 1.
Fig. 4.20 State of the delay station in the LU during pipeline cycle # 7
179
EAC stream. The EAC stream will become operational in the
following pipeline cycle. This is illustrated in Figs.4.16,

4.17, 4.18, 4.19 and 4.20.
Pipeline cycles # 8:
The multiplication instruction is issued to the DS of

stage 1 in the AU unit. The delay of 3 cycles is necessary
to resolve the RAW hazard. The delay is calculated as shown
in pipeline cycle #5. The counter c5 is initialized with the
value of 10. The EAC stream fetches the instruction starting
from the label 60 along with the P I C stream. Both the
instructions are assumed to be non branch instructions. The
jump instruction is in the decode stage. The register R1 is
loaded with the value of 20. The state of the pipeline is
illustrated in Figs 4.21 to 4.25.
Pipeline cycle # 9:
The jump instruction is in the issue unit. From the

instruction status unit, the value of r5 will be available
only after 11 cycles. Hence the jump instruction can only
be evaluated after 11 cycles. The instruction is issued to
the DS of the LU unit with a delay of 12 cycles. The issue
unit along with the decode unit will be disabled for the
next 11 cycles from the next cycle. The instructions present
at the bottom of both the stacks are decoded and the
information is placed in the instruction status unit. The
register R2 is updated with the value of 30. This is
illustrated in Figs. 4.26 to 4.30.
Counter 1
Fetch unit
Counter 2
EAC
Queue
Decoder unit 2
Decoder unit 1
C
I
lssue
unit 2

u

U

Issue
unit 1
Logic
unit
Fixed
point
unit
Floating
point
unit
Register
set
Fig. 4.21

Fig. 4.22
Unit 4
Unit 5
Unit 6
Unit 7
ASRI : Address of source register 1 ASR2 :Address of source register 2
SD1 :Source Data 1.
Fig. 4.23

OR : Destination Resource
Fig. 4.24
Unit 4
Unit 5
Unit 6
I
Unit 7
DSRl :Delay of source register 1.
Fig. 4.25
State of the delay station in the LU during pipeline cyde # 8
PIC
Queue
EAC
Queue
Decoder unit 1
Decoder unit 2
------
Hold instruction in issue unit

.
.
Issue
unit 2
Fixed
point
unit
~
-

i
Issue
unit 1
Floating
point
unit
Register
set
Fig. 4.26 The state of the system at pipeline cycle # 9
mult

Fig. 4.27
Unit 4
Unit 5
,
Unit 6
Unit 7

ASR1 :Address of source register 1 ASR2 :Address of source register 2
S D I : Source Data 1.
Fig. 4.28
---- - -
- - -- --
Unit 4
Unit 5
Unit 6
Unit 7
Fig. 4.29
Unit 4
Unit 5
Unit 6
Unit 7
DSR1 :Delay of source register 1.
Fig. 4.30
188
Pipeline cycle # 10:

The first add instruction is due for execution. The
initial cross collision matrix was a null matrix until this
cycle. The first instruction to be initiated is the add
instruction. The initial cross collision matrix for addition
will now become the initial state matrix of the pipeline
system. The initial state matrix is shown in Fig. 4.31. The
add instruction is initiated as the latency is available.
The register R3 is updated with a value of 40. The state of
the system is illustrated in Figs. 4.32 to 4.35.
The multiplication instruction is due for execution.
The state matrix of the previous cycle is shifted one row
to the left. The latency is checked for multiplication by
examining the row 1 and column 0. The latency is available.
The instruction is initiated and the new state matrix is
s h o w n in Fig. 4 . 3 6 . T h e data f l o w in t h e system is
illustrated in Figs. 4.37 to 4.40
The second add instruction is due for execution. The
state matrix of the previous cycle is shifted one column to
the left and is illustrated in Fig. 4.41. The latency for
add is available as the element of row 3 and column 0
contains 0. The instruction is initiated and the new state
matrix is obtained as shown in Fig. 4.42. The data flow is
illustrated in Figs. 4.43 to 4.46.
Fig. 4.31
Initial state matrix in cycle # 10
Counter 1
Fetch unit
Counter 2
EAC
Queue
Decoder unit 2
Decoder unit 1

Disabled
Issue
unit 2
Disabled
lssue
unit 1
Logic
unit
Fixed
point
unit
Floating
point
unit
Register
set
Each Unit is a Delay Buffer
Unit 4
Unit 5
Unit 6
Unit 7
Fig. 4.33
Unit 4
Unit 5
Unit 6
Unit 7
Fig. 4.34
Unit 4
Unit 5
Unit 6
Unit 7
i

ASRl :Address of source register 1
Fig. 4.35 State of the delay station in the LU during pipeline cyde # 10
Fig. 4.36
The new state matrix in cycle # 11
Counter 1
Fetch unit
Counter 2
EAC
Queue
Decoder unit 1
Decoder unit 2
.
Disabled
I
lssue
unit 2
Disabled
---
- -
lssue
unit 1
Logic
unit
Fixed
point
unit
Floating
point
unit
Register
set
Unit 4
Unit 5
Unit 6
Unit 7

Fig. 4.38
Unit 4
Unit 5
Unit 6
Unit 7
S D l : Source Data 1.
Fig. 4.39
Unit 4
Unit 5
Unit 6
Unit 7
Fig. 4.41
The shifted state matrix of cycle # 11
Fig. 4.42
The new state matrix of cycle # 12
Counter 1
Fetch unit
Counter 2
EAC
Queue
Decoder unit 1
Decoder unit 2

Disabled
Disabled
lssue
unit 2
rn
lssue
unit 1
Disabled
Logic
unit
Fixed
point
unit
Floating
point
unit
Register
set
Unit 4
Unit 5
, Unit
Unit 7
Fig. 4.44 State of the delay station in stage 1 of the AU during pipeline cycle # 12
Unit 4
Unit 5
Unit 6
Unit 7
Fig. 4.45 State of the delay station in stage 6 of the AU during pipeline cycle # 12
Unit 4
Unit 5
Unit 6
Unit 7
Fig. 4.46
P i p e l i n e Cycles # 13
# 20:
The results are updated as they become available and

the branch instruction is evaluated during the pipeline
cycle # 19. These are illustrated in Figs. 4.47 to 4.54.
Counter 1
Fetch unit
Counter 2
EAC
Queue
ww
PIC
Queue
Decoder unit 1
Decoder unit 2

Disabled
Disabled
lssue
unit 2
lssue
unit 1
Disabled
Logic
unit
Fixed
point
unit
Floating
point
unit
Register
set
Unit 4
Unit 5
Unit 6
I
Unit 7

Fig. 4.48 State of the delay station in the LU during pipeline cycle # 13
Counter 1
Fetch unit
Counter 2
EAC
Queue
PIC
Queue
Decoder unit 2
Decoder unit 1

Disabled
Disabled
Issue
unit 2
+-
Issue
unit 1
Disabled
m-
Logic
unit
Fixed
point
unit
Floating
point
unit
Register
set
Unit 4
Unit 5
Unit 6
Unit 7
ASRl : Address of source register 1
Counter 1
etch
unit
Counter 2
EAC
Queue
Decoder unit 2
Decoder unit 1

Disabled
lssue
unit 2
Fixed
point
unit
lssue
unit 1
Disabled
Floating
point
unit
Register
set

Pr #
ASR1
R5
Unit 1
DSR1
0
SD1
1200
ID
DR
Unit 2
r
Unit 3
Unit 4
P
Unit 5
I
Unit 6
r
Unit 7

ASRl :Address of source register 1
Fig. 4.52
Counter 1
Fetch unit
Counter 2
EAC
Queue
PIC
Queue
Decoder unit 2
Decoder unit 1
No RAW hazard for instr# 60

n
No WAW hazard for instr# 60

-
No delay is assigned
Issue
unit 2
Route the instruction to logic unit
Issue
unit 1
Logic
unit
Fixed
point
unit
Fig. 4.53
Floating
point
unit
Register
set
The state of the system at pipeline cycle, # 20

Fig. 4.54
State of the Instruction status unit during pipeline cycle # 20
CHAPTER FIVE
COMPUTER SIMULATION AND EXPERIMXNTAL RESULTS
The operation of the system is simulated on the DEC VAX

11/750 mini-computer. The simulation program is implemented
in two sections. The first section simulates the PIU and the
second simulates the PEU. In real time operation, the
various units are synchronised. The units are termed as
stages. The total number of stages in the system are ten.
The first three stages are the fetch unit, decode unit and
issue unit respectively. The remaining seven stages
constitute the stages of the pipelined arithmetic unit. The
program is written in C language. Each stage is simulated
by a single function. This is illustrated in Fig. 5.1.
In actual operation, the stages operate concurrently.
For example, let us assume that decode unit will receive an
instruction I from the fetch unit at the begining of the
cycle J. It processes the instruction and forwards it to the
issue u n i t a t t h e end of t h e cycle. T h e issue unit
meanwhile, receives the instruction 1-1 at the begining of
the cycle J. The instruction I will be received by the issue
unit only at the begining of the cycle J+1. The Fig. 5.2
illustrates the data flow. The stages begin processing at
the begining of each cycle and complete processing at the
end of each cycle. This implies that the simulation program
must begin the execution of functions at the same time. The
Function Fetch-unit
Function Decode-unit
t
Function Issue-unit
*
Function Stage-one
Function Stage-two
Function Stage-three
Function Stage-four
>
.
Function Stage-five
i
Function Stage-six
Function Stage-seven
Fig. 5.1
Structure of the simulation program
Begining of pipeline cycle # I
Begining of pipeline cycle # I
Fig 5.2
Actual data flow in real time for a pipeline system
213
executions must also end at the same time. This is not

possible in a serial machine. The program is executed in
iterations where each iteration represents one pipeline
cycle. In each iteration, the functions emulating the stages
are executed serially in their physical order. Furthermore,
the input for each function is provided by the preceeding
function. The parallelism of executing the functions will
be lost, if the output of any function is fed directly to
the input of the next function. This will lead to the
pipeline being a large sequential system. The concurrency
is introduced into the serial program by bifurcating the
processing and the data transfer. Each function will read
the input from a buffer called the input buffer. The result
of the function is loaded into an output buffer. The next
function will read the data from the input buffer of the
function but not from the output buffer of the previous
function. For example, the function emulating the decode
unit will read the input from the input buffer specified to
this unit and processes the instruction. The processed
instruction is stored in an output buffer designated to the
decode unit. The next function emulating the issue unit will
read the instruction from its input buffer but not from the
output buffer of the decode unit. This provides isolation
of data between two adjacent functions. The data transfer
is carried out before the begining of the next iteration.
This is shown in Fig. 5.3. The whole program is executed
The data flow in iteration I between functions during operationg mode
lnput buffer Output buffer
(result of operation)
(result of operation)
Start .
b
End
Start ,
-b
End
excuting
excuting
excuting
excuting
Function decode-unit Function decode-unit Function issue-unit Function issue-unit
Program Flow
The data transfer operation during data transfer mode during iteration I:
Only data transfer taking place. No function is executed

Fig. 5.3
Emulating the concurrency operation in a serial program.
effectively in two modes: the operating mode and the data

transfer mode. Thus parallelism is obtained by serially
executing the functions and controlling the data transfer.
The program is subdivided into two groups of functions.
The first group is called the Tongroup. The Tongroup
represents the serial execution of the various emulating
functions. The second group is termed as the TOf,group. Toff
group represents the data transfer functions. The various
functions are described in the following sections:
5.1
FUNCTIONS EMULATING THE STAGES OF THE PIU:
The PIU is emulated by three functions: 1) fetch-unit,

2) decode-unit, and 3) issue-unit. The memory, the input and
the output buffers for each function are held in structures.

A structure is a collection of one or more variables,
possibly of different types, grouped together under a single

name for convenient handling. The structures of the buffers
and the memory are of the same type and is shown below:
struct input inst
{
int
int
int
int
int
opcode field;
source-operandl;
source-operand2;
dest operand;
valid;
The various functions are described below.

Function "fetch-unit" :
This function emulates the operation of the fetch unit.
The function is provided with two sets of counters. These
216
counters are used to keep track of the branch instructions.

The counters are stored as structures and are individually
indexed from 0 to 9. The counter indexed as 0 is the program
counter. The instructions are classified by this function.
If a branch instruction is encountered, the effective
address held in the destination operand is loaded into the
appropriate counter. The instruction is classified using the
opcode. The function returns with the processed instruction
loaded into the output buffer. The function is provided with
two ouput buffers, one for each stream.
Function "decode-unit" :
This function emulates the decode unit. It consists of

a two FIFO stacks, which are represented as structures. The
data is read from the input buffer and is loaded into a FIFO
queue. The function contains two queues and two decoding
routines. The decode unit has a queue and a decoder routine
for each stream.
Pointers are associated with each queue.
The pointer top-stack indicates the next free location. The

pointer bottom-stack points to the next instruction to be
processed. When the queue is empty, the top-stack is equal
to the bottom-stack. The instruction travels through the
queue before it is decoded. The decoded instruction is
stored in the output buffer. The information that is made
available in the decoding process is stored in a structure
which represents the instruction status unit. This
information is utilized by the issue unit to schedule the
execution of the instruction.

Function
"
issue-unit ":
The function issue-unit emulates the function of the

issue-unit. The functions receives the instruction from the
input buffer and detects the hazard. The hazards are
detected by looking at the counter values which are assigned
t o each register. These values are available in the
structure that represents the instruction status unit. The
hazards are resolved and the instruction is issued to the
execution unit. The hazards are resolved by the equations
that have been derived in Chapter 2. The function returns
with the issued instruction placed in the output buffer.
5.2 FUNCTIONS EMULATING THE STAGES OF THE PEU:
The execution unit is the pipelined artihmetic unit.

The logic unit and the floating point unit are treated as
black boxes. The functions emulating the stages of the
execution unit are also named after their respective stage.
For example, the function emulating stage one of the PEU is
called as function stage-one.
Function
stage-one ":
The function stage-one emulates the operation of a

shifted multiplicand generator. This function generates the
initial partial products. The partial products are summed
up to derive the final product. The shifted multiplicand
generator is an landt gate array. The vectors that are
generated follow the following equations:
a3*bj a4*bj a5*bj a6*bj a7*bj

W I = a O *b. Ja *b.~ a2*bj
J
( j = 0)
= a * b - al*b. a2*b. a,*bj a4*bj a,*bj a6*bj a7*bj
( j = 1)
W3 = a O *b. Ja *b.~ a2*bj

J
( j = 2)
W4 = ao*b. al*b. a2*bj a3*bj a,*bj a5*bj a6*bj a7*bj
( j = 3)
W5 = a O *b. Ja *b.~ a2*bj

a3*bj a4*bj a5*bj a,*bj a7*bj
J
( j = 4)
w6
a O *b.J a l *b.J a2*b.

a3*b.J a4*bj a5*bj a6*bj a7*bj
J
( j = 5)
W7 = a O *b. Ja *b.~ a2*b.

a3*bj a,*bj a,*bj a6*bj a7*bj
J J
( j = 6)
w8
( j = 7)
w2
a O *b.J a l *b.J a 2 * b
J
The e l e m e n t s a i a n d b j b e l o n g t o t h e i n p u t
b i n a r y v e c t o r s A and B.
Functions
"
stage-two
to
stage-five
n:
The s t a g e s two t o f i v e c o n s i s t o f t h e C S A e l e m e n t s .
These f u n c t i o n s s i m u l a t e t h e o p e r a t i o n of t h e CSA elements.
The c a r r y s a v e a d d e r is r e p r e s e n t e d by u s i n g t h e e q u a t i o n s
o f t h e sum a n d t h e c a r r y v e c t o r s . T h e s e f u n c t i o n s r e t u r n
w i t h t h e r e s u l t i n t h e o u t p u t b u f f e r s which a r e r e p r e s e n t e d
a s structures.
Function
"
stage-six ":
The c a r r y look ahead adder i s a l s o reproduced w i t h t h e

a i d of s t r u c t u r e s and f i e l d s . The p a r t i a l sum v e c t o r and t h e
p a r t i a l c a r r y v e c t o r from t h e f u n c t i o n s t a g e-f i v e , i s t h e
i n p u t t o t h e p r e s e n t f u n c t i o n . This f u n c t i o n f i r s t g e n e r a t e s
t h e c a r r y e l e m e n t s u s i n g t h e i n p u t s , and t h e n t h e a d d i t i o n
t a k e s p l a c e i n t h e f u l l adder using t h e s e c a r r y elements.
The f u n c t i o n can a l s o r e c e i v e t h e i n p u t s e x t e r n a l l y . These
i n p u t s a r e n o t r e l a t e d t o t h e o u t p u t of s t a g e f i v e . The add
219
instruction is introduced to the adder through the above

inputs.
The functions stage-one to stage-six are combined into
a single function called 99pipelineN.
5.3
CONTROL OF THE PIPELINE:
The pipeline activity has to be controlled to avoid the

collisions of the data that is fed into the pipeline. The
control of the pipeline can be broadly classified into the
following control functions namely: 1) load-pipeline, 2)
output-check, 3) set-logg, and 4) shift-track. The function
load - pipeline mainly deals with the initiation of
instructions to the pipeline. The function output-check
loads the results of instructions to their destination. The
function shift- trac is used to monitor the flow of
instructions in the arithmetic unit. The function set-logg
monitors the activity of the function shift-trac.
The following structures are used by the functions to
control the operation of the pipeline: 1) struct reg-stages,
2 ) struct iter-storage, 3 ) struct add-trac, 4 ) struct
mult-trac, 5) struct div-trac, 6) struct logg-sheet, 7)

struct multipurpose-registers.
The names of these structures state their respective
operations. Structure reg-stages are the output buffers. The
structures with names ending with 'tract are used to track
the instructions through the pipeline and each operation has
its own tracking registers. The structure logg-sheet is used
220
to monitor the 'tract structures. Each operation has its own

l o g g- s h e e t . O n e o f t h e s t r u c t u r e s o f t h e
multipurpose -registers is used as the control status
register. This register is used to pass control information
between the control functions.
Function
loadgipeline
'#:
The input control deals with the loading of the

arithmetic unit with the instructions from the structure
representing the delay station (struct delay-station). The
function scans the structure delay-station for instructions
that need to be initiated into the pipeline during each
iteration. The instructions that need to be executed are
checked with the available latency. The latency information
is available from the state matrix. If the number of
instructions to be initiated are one and the latency is
available, the instruction is initiated into the pipeline
unit. The token for the destination register is loaded into
the tracking register which tracks this instruction through
the stages. When more than one instruction contends for the
same stage and latency, the instruction with the higher
priority is initiated into the pipeline. The instruction
with the highest priority are those which are being
iterated. If no such instruction is present, then priority
is given to the instruction that has been in the structure
delay-station for the longest time. The instructions that
have not been initiated are reissued additional delays. The
221
counters in the structure instruction-status are updated

with this delay.
Function
output-check ":
The output control can be divided into the following

operations namely 1) non-divisional output control and 2)
division output control.
Non-divisional output control:
The non-division operations are addition, subtraction
and multiplication. The output control mainly deals with the
removal of the data from stage seven for the above mentioned
instructions. The information that the instruction has
reached stage seven is given by a tracking register assigned
for that particular instruction. For the multiplication
operation, the result is obtained when the tracking register
indicates stage seven. For addition and subtraction
operations, the result is obtained in the next cycle. The
result is loaded into the register specified by the tracking
register. The tracking register is initialized by the
function load-pipeline.
Division output control:
This deals with the operation of division only. When
the the tracking register pertaining to this instruction
indicates stage 7, the output of the stage six and stage
seven are taken from the pipeline and stored in the
structure priority-stack. The tracking register associated
with this instruction is made inactive. The tracking fields
are reset and the iteration counter is incremented. The

semi-processed instruction is given the highest priority and
will be initiated with the first available latency. The
number of iterations for the division instruction is fixed
at three. If the iteration counter is equal to 3 at the time
it is incremented, then the result is achieved and is
transferred to the appropriate register.
shift-trac
Function
I@:
T h e f u n c t i o n s h i f t- t r a c i s u s e d t o t r a c k t h e
instructions in the arithmetic unit. Each instruction that
is initiated is assigned a tracking register. The tracking
registers contain seven fields and each field represents a
stage. T h e tracking register will also contain the
destination register for the result of the instruction. When
an instruction is initiated at stage one, the tracking
register assigned to the instruction is initialized by
placing a token in field one. This function advances the
token to the next field denoting that the instruction has
moved to the next stage. When the token indicates that the
instruction is at the output stage, the function
output-check loads the result into the register specified
by the tracking register.
Function
I9
set-logw:
This function keeps track of all the tracking registers

that are in use. This function is used to update the
information of free tracking registers.
223
Function
time-off
I1
The function t ime-of f trans fers the data from the

output buffer of one stage i.nto the input buffer of the next
stage. This function maintains the data flow from one stage
to the other. The source code for emulating the PEU and PIU
is listed in appendix C.
5.4
COMPUTER GENERATION OF THE STATE DIAGRAMS:
The generation of the state diagram was implemented on

the VAX 11/750. The exact number of states cannot be
formulated as a general polynomial equation. However we can
calculate the maximum number of states that are possible by
choosing the number of rows and columns. Starting from the
three initial collision matrices (three initial states), the
various state matrices are generated. The program consists
of various functions that are used to generate the state
matrices.
The functions are aided by two integer pointers
that monitor the generation of the state matrices. The state

matrices are held in structures. The integer pointers are
briefly described below. The functions are briefly described
below.
Pointer
index ":
This is a integer pointer which is continuously updated

as new states are being generated. The function of this
pointer is to determine whether all the possible states have
been derived.
224
Pointer
pres-num
n:
This is also an integer pointer which counts down from

the state one. It is incremented after all the possible
states formulated from the current state have been derived.
When the value of countdown is equal to the pointer "index1I,
the program is terminated.
sleft-bits ":
Function
This function shifts left the present matrix under

consideration to the required number of times. The required
number of times is given by the latency. The leading zeros
are introduced from the right. The resulting matrix is
represented by a structure.
Function
"
or-cross
I*:
In this function, the state matrix which is stored as

a structure is ORed with the required initial collision
matrix. If the initiation is double initiation then the
combined initial matrix is derived and then it is ORed to
the current state matrix.
Function
*name-it
It:
This function is used to determine whether the new

state matrix is unique and that it does not have a copy of
itself, in the other state matrices that have been generated
earlier. This function generates the link list for the state
matrix under consideration in check-struc. The
I*'
in the
name indicates that this function is repeated for each

compatible initiation set.
225
Function
check struc ":
Function check-struc generates the new state matrices

from the current state. After each new state is created, it
is checked with the states that have been created. If a copy
of this state is not present, then the pointer index is
incremented and the state matrix under consideration is
assigned a new state. Each new state matrix is stored in a
structure. The structure is provided with fields which
correspond to all possible initiations. These fields are
used to store the address of the next state for that
particular initiation. The pointer pres-num indicates the
current state to be investigated. From each state all
possible new states are derived. The function returns with
all the new states. The address fields are used as link
lists and they contain the address of the states. The source
code for generating the state diagram is listed in appendix
B. The state diagrams are presented in appendix A.
5.5
EXPERIMENTAL RESULTS:
The simulation of the system was carried out on three

different instruction sets. The first instruction set
contained only the RAW hazard. The second instruction set
represented the RAW and W A W hazard. The third set has
incorporated the three hazards. The instructions that which
the program is capable of recognizing is shown in Fig. 5.4.
The format for each instruction is shown in Fig. 5.5. The
data is initially loaded as integers. It is converted to the
I
CPCCOE
NMrnICS
OPmm
N3P
NoOPERAm
ADD
ADDITION
SUB
SUBTRACTION
MULT
MULTIPUCATION
DIVIDE
DIVISION
SrCRE
STORE ( REGISTER -> MEMORY)
LCY\D
LOAD( REGISTER C- MEMORY )
LOAD1
LOAD ( MEMORY <- DATA )
INC
INCREMENT
EC
DECREMENT
10
AND
AND
11
CR
12
NCrr
NCrr
13
BRANCH
UNCONDITIONALBRANCH
14
BRANCF1NZ
BRANCH IF NOT ZERO
15
BRANCHNC
BRANCH IF NO CARRY
Fig. 5.4
The instruction set adopted for simulation.
instruction format adopted for

the arithmetic instructions:
ADDITION , SUBTRACTION, MULTIPLICATION, DIVISION
un
Source register1

LOAD:
Source register2
the data transfer instructions:
STORE:
Memory
location
Memory
location
Destination
reaister
lnstructlon format adopted for

AND, OR :
mn
Destination
reaister
Source registerl

INC, DEC, NOT :
Destination
realster
the logic instructions:
Source register2
the logic Instructions:

BRANCH :
the logic instructions:
Destination
address
lnstruction format adopted for

BRANCHNZ, BRANCHNC :
the logic Instructions:
Fig. 5.5 Instruction format a d o p t e d in the s i m u l a t i o n p r o g r a m
binary form at the entrance to the execution unit. This is

done to simplify the program.
The first instruction set is listed in Fig. 5.6. The
result was obtained after 19 iterations. Each iteration
represents a single pipeline cycle. The results are
tabulated as shown in Fig. 5.6. The flow diagram is shown
in diagram 5.7. The results of the second and third
instruction sets are also illustrated in the same way. The
results of the second instruction set is shown in Figs. 5.8
and 5.9. The results of the third instruction set is shown
in Figs. 5.10 and 5.11. The timings coincide with the design
values.
The program is capable of handling ten instructions at
a time. It is now being modified to run for larger sets and
involving branch instructions. A sample set of instructions
listed in Smith [14] is being used to run the simulation.
The instruction set is the micro-code for a loop in Fortran.
The macro code is listed below:
10
Do 10 I = 1,100
A(1) = B(1) + C (I)*D(I)
The micro code for loop section of the macro code is given
below:
100: load r8, (C);
load r9, (D);
load r10, (B);
load rll, (A);
add r3, r8, r2;
add r4, r9, r2;
add r12, rll, r2;
mult r6, r3, r4;
add r7, r6, r5;
store (r12), r7;

dec r2;
branchnz 100, r2;
The space time diagram for the static scheduling and
execution of the sample set for two loops is shown in Fig.
5.12. The Fig. 5.13 illustrates the space time flow in the
proposed system. The flow in the figures 5.12 and 5.13
represents the hand simulation based on the proposed system.
Speed up is achieved due to the dynamic scheduling and
execution.
Instruction set # 1:
load r l , (X);
load r2, (Y);
add r3, r2, r l ;
store (Z), r3;
Location
(X) = 20
(Y) = 30
Iteration #
Contents of the location
r l
20
r2
10
30
r3
13
50
(z)
19
50
4
Fig. 5.6
Program results for instruction set # 1
Iteration #
load r l , (X);
load R, (Y);
add r3, r2, r l ;
store (Z), r3;
2 3
D I
F D
F
E
I
D
F
E E E E E
E E E E E E
l
D I
101112131415
Fig. 5.7 Space time flow of the instructions set # 1
E E
E
E E
load r l , (X);
load R , (Y);
add r3, r2, r l ;
store (Z), r3;
load
r3, (A)
Location
(X) = 20
(Y) = 30
(A) = 40
Iteration #
Contents of the location
r l
20
r2
10
30
r3
13
50
(2)
19
50
r3
18
40
Fig. 5.8
Program results for instruction set # 2
Iteration #
load r l , (X);
bad r2, (Y);
add r3, r2, r l
store (Z), r3;
load r3, (A);
Fig. 5.9
Space time flow of the instruction set #2
load r l , (X);
load r2, (Y);
add r3, r2, r l ;
add r4, r2, r l ;
store (Z), r3;
load
r3, (A)
(X) = 20
(Y) = 30
(A) = 40
Fig. 5.10 Program results for instruction set # 3
Fig. 5.1 1 Space time flow of the instruction set # 3
Iteration #
load r l O,(B);
add r3,r8,r2;
add r4,r9,r2;
add r5,r10,r2;
add r12,rll ,r2;
mutt r6, r3, r4;
add r7, r6, r5;
store (r12), r7;
branchnz 100, r2
Fig. 5.12
I I
Space time flow in case of static scheduling
Iteration #
load r8, (C);

load r9, (D);
load r l O,(B);
load r l l ,(A);
add r3,r8,r2;
add r4,r9,r2;
add r5,r10,r2;
add r 1 2 , r l l ,r2;
mutt r6, r3, r4;
add r7, r6, r5;
store (r12), r7;
dec r2;
branchnz 100, r2
Fig. 5.12
(Cont'd)
tteration #
load r8, (C);

load r9, (D);
load r l O,(B);
2 8 2 9 3 0 31 3 2 33 34 35 36 3 7 38 3 9 40 41 42
load r l l ,(A);
add r3,r8,r2;
add r4,r9,r2;
add r5,r10,r2;
add r12,r11 ,r2;
mult r6, r3, r4;
add r7, r6, r5;
store (r12), r7;
dec r2;
branchnz 100, r2
Fig. 5.12
(Cont'd)
Iteration #
4 3 4 4 45 46 47 48 49 5C51 5 2 5 3 5 4 5 5 5 6 5 7
load r8, (C);

load
a,(D);
load r l O,(B);
load r l 1,(A);
add r3,r8,r2;
add r4,r9,r2;
add r5,r10,r2;
add r 1 2 , r l l ,r2;
mult r6, r3, r4;

add r7, r6, r5;
store (r12), r7;
dec r2;
Sranchnz 100, r2
Fig. 5.12
(Cont'd)
Fig. 5.12
(Cont'd)
Iteration #
load r l l,(A);
add r3,r8,r2;
add r4,r9,r2;
add rS,r10,r2;
add r 1 2 , r l l ,r2;
mult r6, r3, r4;
add 77, r6, r5;
store (r12), r7;
branchnz
Fig. 5.13
100, r2
Space time flow in case of dynamic scheduling
Iteration #
1 6 1 7 18 19 2 0 2 1 2 2 23 2 4 2 5 2 6 2 7 2 8 2 9 3 0
load r8, (C);
load r9, (D);

load r l O,(B);
load r l 1,(A);
add r3,r8,r2;
add r4,r9,r2;
add rS1r10,r2;
add r 1 2 , r l l ,r2;
mutt r6, r3, r4;
add 17, r6, r5;
store (r12), r7;
dec r2;
branchnz
Fig. 5.13
100, r2
Space time flow in case of dynamic scheduling (Cont'd)
Iteration #
load r8, (C);

load r9, (D);
load r1O,(B);
load r l l , ( A ) ;
add r3,r8,r2;
add r4,r9,r2;
add rS,rlO,r2;
add r12,rll ,r2;
mult r6, r3, r4;
add r7, r6, r5;
store (r12), r7;
dec r2;
branchnz 100, r2
Fig. 5.13
Iteration #
3 8 3 9 4 0 41 42 43 4 4 4 5 4 6 4 7 4 8 49 5 0 5 1 5 2
load r8, (C);
E E
--
(D);
load 1'9,
load r l O,(B);
E E E E
load r l 1,(A);
--------
add r3,r8,r2;
add r4,r9,r2;
add r5,r10,r2;
add r12,rll ,r2;
mutt r6, r3, r4;

add r7, r6, r5;
store (r12), r7;
dec r2;
Dranchnz 100, r2
1
Fig. 5.13
Iteration #
load r8, (C);

load r9, (D);
load rl O,(B);
load rl 1 ,(A);
add r3,r8,r2;
add r4,r9,r2;
add r5,r10,r2;
add r12,rll ,r2;
muit r6, r3, r4;
add r7, r6, r5;
store (r12). r7;
dec r2;
branchnz 100, r2
Fig. 5.13
CHAPTER SIX
CONCLUSIONS AND DISCUSSION
In this research, we have presented an algorithm for

dynamic instruction scheduling in a pipelined system.
Initially, the instructions fetched by the fetch unit are
classified. This classification is carried out to ascertain
the type of the instruction. If the instruction is found to
be a jump instruction, the second stream is made
operational. The streams are used to reduce the branch
overheads. The fetch unit keeps track of the all the branch
instructions that have passed through it. This ensures that
the prefetching of instruction commences from the correct
location.
The instruction dependencies are solved by using the
pointers associated with the sink registers. The scheduling
of execution of the instructions are gauranteed hazard free
by the equations derived to resolve the hazards. The buffers
also aide the system by capturing the operands as they
become available. The missing operands are tagged with the
counter value. This eliminates the associative tag
comparisons proposed by Tumasulo [5] and Sohi and Vajepeyam
[GI
The execution unit also operates free of any hazard.

The state matrix ensures that the initiation is hazard free.
It also specifies the compatible initiations. The structure
243
of t h e arithmetic unit allows the execution of two

instructions simultaneously and hence increases the
throughput. T h e execution unit is also c a p a b l e of
rescheduling the scheduled instruction. This makes the
system flexible. This flexibility is needed to ensure hazard
free operation of the system.
The dynamic execution of instructions in a pipelined
environment is hardly used in any of the high performance
computers today. The control of such systems is complicated.
This carries the potential for longer control paths and
longer clock periods. This idea is making a comeback in new
generation RISC processors. The advancements in the VLSI
technology is making it possible to realize such systems.
Interrupt handling and indirect addressing modes have
not been taken into consideration. Furthermore, the design
of the floating point unit and the logic unit has not been
discussed. These areas remains as our further research
effort
REFERENCES
[ 11
Chen, T. C. llUnconventionalsuper speed computer

systems, in AFIPS 1971 spring Jt. Computer Conf ,
AFIPS Press, Montvale, N.J., 1971, pp. 365-371.
[21
McIntryre, D. "An introduction to the ILLIAC IV

computerfW Datamation, April 1970, pp.60-67.
C31
Evensen, A. J. and Troy, J. L. "Introduction to

the architecture of a 288-element PEPE," in Proc.
1973 Sagamore Conf. on Parallel Processing,
Springer-Verlag, N.Y. 1973, pp. 162-169.
141
Rudolph, J. A. "A production implementation of an

associative array processor-STARANfl@in AFIPS 1972
Fall Jt. Computer Conf., AFIPS press, Montvale,
N.J., 1972, pp. 229-241.
Tumasulo, R. M., l1An efficient algorithm for

exploiting multiple arithmetic units," IBM Journal
of Research and Development, January 1967, pp. 2533.
Sohi, G. S. and Vajapeyam, S. llInstructionissue
logic for high performance interruptable pipelined
processors," ACM, June 1987, pp. 27-34.
Keller, R. M. "Look ahead processors," Computing
surveys, Vol. 7, No. 4, December 1975, pp. 177195.
Dennis, J. B. ltModular,asynchronous control
structures for a high performance processor,It ACM
Conf. Record, Project MAC Conf. on concurrent
systems and parallel computation, June 1970,
pp. 55-80.
[91
Tjaden, G. S. and Flynn, M. J. "Detection and

parallel execution of independent instruction^,^^
IEEE Trans. Computers, Vol. c-19, No.10, October
1970, pp. 889-895.
[ 101
Ramamoorthy, C. V. and Kim, K. H. "Pipelining-the

generalized concept and sequencing strategiesfVt
Proc. NCC. 1974, pp. 289-297
Smith, J. E. and Weiss, S. "Instruction issue

logic for super pipelined computers," IEEE Trans.
Computers, Sept 1984, pp. 110-118.
Thorton, J. E. Design of a computer - the Control
Data - 6600, Scott, Foresman and Co., Glenview,
IL, 1970
De~erell,J.~~Pipeline
iterative arraysImIEEETrans.
Computers, Vo1.c-23, No.3, March 1975, pp.317-322.
Smith, J. E. I1Dynamic instruction scheduling and
Astronautics ZS - 1," Computer, July 1989, pp. 2135.
Ramamoorthy, C. V. and Li, H. F. I1Pipelined
architecture^,^^ Computing Surveys, Vol - 9, No.
1, March 1977, pp. 61-101.
Ramamoorthy,C.V. and Li, H.F."Efficiency in
generalized pipeline network^,^^ National Computer
Conference, 1974, pp.625-635.
Sze, D. T. and Tou, J.T. I1Efficientoperation
sequencing for pipeline machines," Proc. COMPCON,
IEEE NO. 72CH 0659 - 3C, 1972, pp 265-268.
Davidson E. S. "Scheduling f o r pipelined
processors," Proc. 7th Hawaii Conf.on System
Sciences, 1974, pp. 58 - 60.
Shar, L. E. nDesign and scheduling of statically
configured pipelines,I1 Digital Systems, Lab Report
SU-SEL-72-042, Stanford University, Stanford, CA,
September 1972.
Patel J. H. and Davidson E. S. "Improving the
throughput of a pipeline by insertion of delays,
IEEE/ACM 3rd Ann. Symp. Computer Arch., IEEE No.
76CH 0143 - 5C, 1976, pp 159-163.
Thomas A. T. and Davidson E. S. llSchedulingof
multiconfigurable pipelines,I1 Proc. 12th Ann.
Allerton Conf. Circuits and System Theory, Univ.
of Illinois, Champaign - Urbana, 1974, pp. 658669.
1221
Anderson, S.F., Earle J.G., Goldschmidt, and

P o w e r s D.M.,
T h e I B M S y s t e m / 3 6 0 M o d e l 91:
Floating Point Execution Unit, " IBM J. Res. Dev.,
January 1967, pp 34-53.
[231
H w a n g K., a n d B r i g g s F . A., It C o m p u t e r
Architecture and Parallel Processing," McGraw-Hill
Book Company, 1984.
APPENDIX
A.
B.
Cross Collision Matrices
Computer Program for Generating State Diagrams

C.
Simulation Program
APPENDIX A.
Cross Collision Matrices
APPENDIX B.
Computer Program for Generating State Diagrams
/*
generation of cross collision matrices
*/
# include <stdio.h>
# include <math.h>
# define true 1
# define false 0
struct matrix
{
int bits rowl[8];

int bitsMrow2 [8];
int bits"row3
[8];
>;
struct direction
{
int
int
int
int
int
div latency[8];
mult latency[8];
add iatency[8];
div-add[8];
mult-add[8];
1:
struct ident
{
int name ;
1:
struct collision-matrix
{
struct matrix smatrix;

struct direction sdirection;
struct ident sident;
1;
struct collision-matrix binary matrix[l50];

struct collision-matrix f~rpresent~upto
-next,last-temp;
int index, pres-num;
/*
i*
/*
FUNCTION FOR INITIALIZING
/*
/*
*/
*/
*/
*/
*/
struct collision-matrix init-cross(now)

struct collision-matrix now;
{
static struct collision-matrix new =

{
1
f
0
);
now = new;
return (now);
*/
/*
/*
/*
/*
/*
*/
Function for oring of matrices
*/
*/
*/
struct collision matrix or cross(matrix -ofmatrix-two)

struct collision~matrixmatrix o;
matrix-two;
{
struct collision matrix matrix-one;

int j,k,l,m~
matrix one = init-cross(matrix one);
upto next = init-cross (upto-next) ;
for(j=o?j<8;++j)
{
matrix-one.smatrix.bits rowl[j]
=(matrix
o.smatrix.bits -rowl[j]
I
matrix-two.sma~rix.bits
-rowl[j]);
matrix-one.smatrix.bits row2[j]
=(matrix
o.smatrix.bits-row2[j]
I
matrix-two.smatrix.bits -row2[j]);
matrix-one.smatrix.bits row3[j]
=(matrix
o.smatrix.bits -row3[j]
I
matrix-two.sma~rix.bits
-row3[j]);
)
matrix o = matrix one;

upto next = matrix one;
return ( matrix-oneymatrix-two) ;
1
/*
/*
/*
/*
/*
FUNCTION FOR SHIFTING THE COLLISION BITS
struct collision matrix sleft bits(present,number)

present;
int number;
{
*/
*/
*/
*/
*,/
struct collision-matrix use-once;

int left;
use once = init cross(use once);
for(left=o;left<(8
- number);++left)
{
use once.smatrix.bits rowl[left]

present.smatrix.bits-rowl[left + number];
use once.smatrix.bits row2[left]
present.smatric.bits -row2[left + number];
use once.smatrix.bits row3[left]
present.smatrix.bits -row3[left + number17
1
present = use once;
return (present);
1
/*
/*
/*
/*
/*
/*
FUNCTION NAME IT
*/
*/
*/
*/
*/
*/
=
=
s t r u c t
c o l l i s i o n
m a t r i x
dname it(present,coming,ineex,pre nutrepeat)
struct collision-matrix present,corning[];
int ineex,repeat,pre-nu;
f
struct collision matrix temp,templ,temp2[];

int i,j,k,l,m,find-sucess,number,consider;
unsigned int flag;
flag = 0;
find sucess = 0;
number = index;
consider = pres num;
for(i=l;i<=number;++i)
{
for(j=O;j<8;++j)
I
flag
flag
=
=
flag << 1;
flag I 1;
flag
flag
=
=
flag << 1;
flag I 0;
else
{
if (flag
==
255)
binary-matrix[consider].sdirection.div-latency[repeat]
find sucess
breaE;
i;
1;
1
else
{
flag =O;
1
1
if (find-sucess != 1)
{
binary matrix[number+l] = present;

binarymatrix
[number+l].sident.name
binary matrix[consider].sdirection.div
number-+ 1;
index = number + 1;
1
return;
number
+ 1;
-latency[repeat]=
/*
/*
/*
/*
/*
/*
FUNCTION NAME IT
*/
*/
*/
*/
*/
*/
s t r u c t
c o l l i s i o n
m a t r i x
mname it(present,coming,ineex,pre nutrepea%)
struct collision matrix present1c6ming[];
int ineex,repeatTpre-nu;
{

int i,j,k,l,m,fi~d
-sucess,number,consider;
unsigned int flag;

flag = 0;
find sucess = 0;
number = index;
for(i=l;i<=number;T+i)
{
for(j=O;j<8;++j)
{
if (((present.smatrix.bits rowl[j] ==
c o m i n g [ i ] . s m a t r i x . b i t s - rowl-[j])
& &
( p r e s e n t . s m a t r i x . b i t s row2[j]
= =
coming[i].smatrix.bits
row2[j]))
& &
( p r e s e n t . s m a t r i x . b i t s-row3[j]
= =
coming[i].smatrix.bits -row3[j]))
{
flag = flag << 1;

flag = flag I 1;
1
else
{
flag = flag << 1;

flag = flag I 0;
1
1
if (flag
==
255)
binary-matrix[consider].sdirection.mult -latency[repeat]=
i;
find sucess = 1;
break;
1
else
{
flag =O;
1
{

binarymatrix[number+l].sident.name
-
number + 1;
binary matrix[consider].sdirection.mult -latency[repeat]=

number-+ 1;
index = number + 1;
}
return;
*/
/*
/*
/*
/*
/*
/*
FUNCTION NAME IT
*/
*/
*/
*/
*/
s t r u c t
c o l l i s i o n
m a t r i x
aname it(present,coming,ineex,pre nutrepea?)
struct collision matrix present,c~ming[];
int ineex,repeat,pre -nu;
{

int i,j,k,l,m,fi~d
unsigned int flag;
flag = 0;
find sucess = 0;
number = index;
{
for(j=O;j<8;++j)
{
& &
( p r e s e n t . s m a t r i x . b i t s
row2[j]
= =
coming[i].smatrix.bits
row2[j]))
& &
( p r e s e n t . s m a t r i x . b i t s-r o w 3 [ j ]
= =
{
flag
flag
=
=
flag << 1;
flag I 1;
flag
flag
=
=
flag << 1;
flag 1 0;
else
{
1
1
if (flag
==
255)
binary-matrix[consider].sdirection.add -latency[repeat]
find sucess = 1;
break;
1
else
{
flag =O;
i;
binary-matrix[number+l] = present;
binary-matrix[number+l].sident.name
binary matrix[consider].sdirection.add
number-+ 1;
index = number + 1;
number
+ 1;
-latency[repeat]=
return;
/*
/*
/*
/*
/*
/*
FUNCTION NAME IT
*/
*/
*/
*/
*/
*/
s t r u c t
c o l l i s i o n
m a t r i x
daname it (present,coming,ineex,pre nu,repeat)
struct-collision matrix present,coming[];
int ineex,repeat:pre
-nu;
{

int i,j,k,l,m,fi~d
unsigned int flag;
flag = 0;
find sucess = 0;
number = index;
flag
flag
flag << 1;
flag I 1;
1
else
{
flag = flag << 1;

flag = flag 1 0;
1
if (flag
==
255)
binary-matrix[consider].sdirection.div -add[repeat]
find sucess = 1;
break;
1
else
{
i;
flag =O;
1
{

binarymatrix[number+l].sident.name
= number
1;
binary-matrix[consider].sdirection.div -add[repeat]= number

+ 1;
index = number + 1;
1
return;
/*
/*
/*
/*
/*
/*
FUNCTION NAME IT
*/
*/
*/
*/
*/
*/
s t r u c t
c o l l i s i o n
m a t r i x
maname it(present,coming,ineex,pre nutrepeat)
struct-collision-matrix present,coming[];
int ineex,repeat,pre-nu;
{

int i,j,k,l,m,fi~d
unsigned int flag;
flag = 0;
find sucess = 0;
number = index;
{
for(j=O;j<8;++j)
{
& &
(present.smatrix.bits
row2[j]
= =
& &
( p r e s e n t . s m a t r i x . b i t s -row3[j]
{
= =
flag = flag << 1;

flag = flag I 1;
else
{
flag
flag
=
=
flag << 1;
flag I 0;
if (flag
==
255)
binary-matrix[consider].sdirection.mult
find sucess = 1;
breaE;
-add[repeat]
i;
else
{
flag =O;
{

binarymatrix
[number+l] sident.name
number
binary-matrix[consider].sdirection.mult -add[repeat]=
1;
index
number
+ 1;
number
1;
return ;
/*
/*
/*
/*
/*
/*
/*
*/
FUNCTION TO GENERATE AND CHECK

THE STRUCTURES FOR NON REPITITION
struct collision matrix check struc(put,inex,pr-nu)

put[]?
int inex,pr-nu;
{
*/
*/
*/
*/
*/
*/
s t r u c t
c o l l i s i o n - m a t r i x
temp-struct,sec struc,third-struc;
int i,j,k,~,consider;

temp struct = init-cross(temp-struct);
temp-struct = put[consider];
sec struc = temp struct;
/* generation of-new structures */
for(j=O;j<8;++j)
{
if (temp-struct.smatrix.bits-row1[j] == 0)
{
struc
sec-struc
-
= sleft bits(sec struc,j);

= or cross(sec ~truc,~ut[l]);
set struc = upto next;
dname-it(sec-struc,put,in&,consider,j)
;
set
1
sec-struc = temp-struct;
1
struc = sleft bits(sec struc,j);

sec-struc
= or cross(sec struc,put[2]);
mname-it(sec-struc,put,in&,consider,j);
set
1
struc
sec-struc
-

set
aname-it(sec-struc,put,in~consider,j);
sec-struc
temp-struct;
if((temp struct.smatrix.bits-rowl[j]== 0)
(temp-struct.smatrik.bits-row3 [ j] == 0))
{
struc = sleft bits(sec struc,j);

sec-struc
= or cross(sec struc,put[l]);
sec-struc = or-cross(sec-struc,put[3]);
set
&&
sec-struc = upto next;

daname-it(sec-struc,put,i~ex,consider,j);
}
for(j=O;j<8;++j)
{
if((temp struct.smatrix.bits row2[j]== 0)

(temp-struct.smatri5.bits-row3 [ j] == 0))
&&
struc
sec-struc
-

sec-struc = or cross(sec struc,put[3]);
set
maname-it(sec-struc,put,i~ex,consider,j);
sec-struc
temp-struct;
return;
1
main ( )
{
struct collision-matrix say-one, say-two, say-three;

int i,j ,k,l,v;
pres num = 1;
index = 3;
binary matrix[l] = init cross(binary matrix[l]);
binarymatrix[2] = init-cross(binary-matrix[2]);
binarymatrix[3] = init-cross(binary-matrix[3]);
binary-matrix[l].smatrikbits rowl[07 = 1;
binary-matrix[l].smatrix.bits-rowl[l] = 1;
binary-matrix[l].smatrix.bits-rowl[2]
= 1;
binary-matrix[l].smatrix.bits-row2[0]
= 1;
binary-matrix[l].smatrix.bits-row2[1] = 1;
binary-matrix[2].smatrix.bits-rowl[0]
= 1;
binary-matrix[2].smatrix.bits-row2[0] = 1;
binary-matrix[2].smatrix.bits-row3[5]
= 1;
binaryrnatrix[3].smatri~.bits-row3[0]
= 1;
binary-~natrix[l].sident.name = 1;
binary-matrix[2].sident.name = 2;
binary-matrix[3].sident.name
= 3;
while( pres-num <= index )

{
check-struc(binary matrix,index,pres-num);
pres-num = pres-nui + 1;
printf(
the various structures are tabulated
below \nu);
printf (!I\nw);
for (v=l; v <= index; v++)
{
for (1=0;1<8;++1)
{
ff,binary
-matrix[v].smatrix.bits
printf (
rowl[l]);
-
" % d \b
printf ("\nff)
;
printf (ff\n")
;
for (1=0;1<8;++1)
{
Iffbinary
printf (If\nw)
;
printf (ff\nff)
;
printf (
row2[1]);
-
If%d \b
1
for (1=0;1<8;++1)
{
ff,binary
printf (
row3[1]);
-
" % d \b
printf ("\nff)
;
printf iff\nff)
;
printf (:*\riff) ;
for (1=0;1<8;++1)
{
printf ("\nff)
;
printf (If \nw) ;
for (1=0;1<8;++1)
{
printf (
",binary-matrix[v].sdirection.mult -latency[l]);
printf (If\nff)
;
printf ( fl\nff)
;
If%d \b
for (1=0;1<8;++I)
{
printf (
",binary-matrix[v].sdirection.add -latency[l]);
If%d \b
1
printf ("\nW);
printf ("\nI1);
for (1=0;1<8;++1)
{
printf (
",binary-matrix[v].sdirection.div~add[l]);
I1%d \ b
1
printf (l1\n");
printf ("\nn);
for (1=0;1<8;++1)
{
printf (
lt,binary-matrix[v].sdirection.mult -add[l]);
I1%d \ b
1
printf ("\nu);
printf ("\nV1)
;
printf (
binary matrix[v].sident.name);
printf7l1\n1l)
;
printf ("\nW);
printf ("\nn);
printf (If\n");
printf ("\nut)
;
printf (tt\nN)
;
I1%d \ b
I f I
APPENDIX C.
simulation Program
................................................
/***** SIMULATION OF DYNAMIC ARITHMETIC *****/
/*****
*****/
/*****
PIPLINE
*****/
/*****
*****/
/*****
VERSION 1.0
*****/
................................................
................................................
..................................................
..................................................
/ * In this program The Eigth bit is stored in 0 */
/ * and the First Bit is stored in 8
*/
..................................................
..................................................
#
#
#
#
include <stdio.h>
include <math.h>
define true 1
define false 0
/*
*/
the structure initializations for the instruction unit
struct input-inst
{
int
int
int
int
int
opcode field;
source-operandl;
source-operand2;
dest operand;
valid;
1;
struct instruction-status
{
1;
int
int
int
int
int
int
int
int
inst num;
pipe-cyl;
opcode;
exec time:
reg-;ti1 [5] ;
count units[5]
decode-ptr;
issue-ptr;
struct reg-file
{
int reg-units[5];
1;
struct status-reg
{
int
int
int
int
carry;
overflow;
sign;
zero;
>;
struct address-counter
C
int counter[20];
int free-index;
>;
struct issue-latch
{
>;
int
int
int
int
int
int
int
int
int
opcode fld;
dest fid;
source1 fld;
srcldata fld;
srcldelay fld;
source2 fid;
src2data fld;
src2delay fld;
instdelay-fld;
-
struct dstack-status
C
int queue select;
int full queue;
int flush flag;
int top stack;
int bottom stack;
1;
struct fetch-status
C
int flush flag;
int address flag;
int picqueue full;
int eacqueue-full;
struct matrix
C
int bits rowl[8];
int bits-row2 [ 8];
int bits-row3
[8];
1;
struct direction
C
int div-latency[8];
int
int
int
int
mult latency[8];
add iatency[8];
div-add[8];
mult-add[8];
1;
struct ident
{
int name;
1;
{
struct matrix smatrix;

struct direction sdirection;
struct ident sident;
1;
struct recode
{
int
bits[15] ;
1;
struct reg-stages
{
int
word [ 17];
1:
struct div-track
{
char name one[3];

int number;
int st track[10];
int itr track;
int address;
1;
struct mult track
{
char name two[4];

int number;
int st track[lO];
int address;
1;
struct add-track
{
char name three[3];

int number;
int st track[lO];
int address;
1;
struct logg-sheet
{
int logg[10] ;
int logg-stat;
1;
struct input-process
{
int
int
int
int
int
int
location;
func;
num one[lO];
num-two[l0];
over flow;
weight;
1:
struct itr-storage
{
int
int
int
int
address;
func;
num one[10];
num-two[l0];
-
1;
struct output-process
{
int
int
int
int
destination;
overflow;
result[l7];
wt-factor;
1;
typedef
typedef
typedef
typedef
typedef
typedef
typedef
typedef
typedef
typedef
struct
struct
struct
struct
struct
struct
struct
struct
struct
struct
collision matrix struct0;

recode structl;
reg stages struct2;
div-track struct3;
mult track struct4;
add track struct5;
logg sheet struct6;
input process struct7;
output process struct8;
itr-storage struct9;
typedef
typedef
typedef
typedef
typedef
typedef
typedef
typedef
struct
struct
struct
struct
struct
struct
struct
struct
input inst structi0;

instruction status structil;
reg file structi2;
status reg structi3;
address counter structi.4;
issue latch structi5;
fetch-status structi6;
dstack-status structi7;
structiO memory[100],decode stackl[20],decode stack2[20];

structi0 *memory ptr,*dstackl ptr,*dstack2 ptr;
structiO iunit l a t c h e s [ 2 ~ ] , i n t e r n a r-h o l d e r s [ 2 0 ] ,
*ilatch ptr,*inhold ptr;
structiy status unit[100], *statusu ptr;
structi2 gp-reg'lster, *gp-ptr; / * general purpose register
*/
structi3 register sr,*sr-ptr;

structi4 pgm-coun~erlIpgm~counter2,*pgm
-ptrl,*pgm-ptr2;
structi5 isunit latch,*isunit ptr;

structi6 stream-status, *sstatus ptr;
s t r u c t i 7 p i c q u e u e statu-s,eacqueue
*picstatus-ptr,*eacstatus-ptr;
-s
tatus,
int queue-select,current-queue,disable-decode,disable-issue;
struct collision matrix binary matrix[l50];
struct collision-matrix for present,upto next,last temp;
structl argumentT[20],argum&1t2[20], multipurpose-reg[20],
*mpreg ptr;
structF latches[30],par product[l0],transfer[30],delay[20];
struct3div follow[l~],delta track[lO],*divflow -ptr,
*deltaflow ptr;
struct3 *copy seven,*copy eight;
struct4 mult follow[10],*~ultflow ptr,*copy nine;
struct5 a d 3 follow[lO] , *addflow-ptr, *copy- ten,
sub follow[l~~,
*subflow ptr;
strkct6 div-logg,mult-Togg,add-logg,process-logg[lO],
*prlogg-ptr;
struct7 input stack[41],*instack ptr,*copy eleven;
struct8 output stack[41],*outstack ptr,*copy-twelve;
struct9 priority ~tack[70],*~rstackptr;
structO *bin pointer;
structl *argi pointer,*arg2 pointer,*copy one,*copy-two;
s t r u c t 2 *par pointer,*lat p o i n t e r , *copy-four,
*copy three,*trans pointer,*copy-five;
struct2 *delay ptr, *copy six;
int op code[20r,arg one[2<][9],arg two[20][9];
int *ptr op,*ptr argmntl[20],*ptr argmnt2[20];
int index, pres ium,stk ptr,total,~ultiplication,division;
int varl,var2,v~r3,var~init
-key, addition, subtraction,
delta-flag;
int
global one[20],global -two[20],
global-three [20],readjust;
-
/*
Functions of the instruction unit
/*
/*
/*
Instruction Fetch Unit
*/
*/
*/
*/
structiO fetch unit( ptrl,ptr2,ptr3,ptr41ptr51ptr6)

structiO *ptrl, *ptr2 ;
/ * pointer to the Memory
*/
structi4 *ptr3,*ptr4;
counters */
structi6 *ptr5, *ptr6 ;
current queue in session
/*
/*
*/
pointers to the address
The flag which denotes the
/*
int i,j,k,l,transfer flag1,transfer flag2;

int program ~ounterl7~ro~ram
-counter2,queue-select;
(ptr2+1)->valid = 0 ;
(ptr2+2)->valid = 0;
To check and flush the redudant queues
*/
if ( (ptr5)->flush flag
== 1)
printf("f1ush flag is enabled\nw);

if ( (ptr5)->address-flag == 1)
{
/*
Flush PIC stream
*/
printf (99flushPIC stream \nl@)

;
(ptr3)->counter[O] = 0;
for (i=l;i<=g;i++)
{
(ptr4)->counter[i]
0;
(ptr4)->free index = I;/* setting the index

flag of the counter2 to 1 so as to indicate that the
counters are flushed and the counter that has to be filled
first is counter[l] */
1
if((ptr5)->address -flag == 2)
{
/*
Flush EAC stream
*/
(ptr4)->counter[O] = 0;
for (i=l;i<=g;i++)
{
(ptr3)->counter[i] = 0;
1
(ptr3)->free index = I;/* setting the index
flag of the counterl to 1 so as to indicate that the
counters are flushed and the counter that has to be filled
first is counter[l] */
1
1
/ * reading the memory for instructions */
/* instructions will be fetched if the program counter
*/
/*
of both the individual streams are non zero
program counterl
program-counter2
-
/*
==
=
=
*/
(ptr3)->counter[O];
(ptr4)->counter[O];
Fetching of instructions for PIC stream
if ( (program-counter1 != 0)
0))
&& (
*/
(ptr5)->picqueue-full
( p t r 2 + 1 ) - > o p c o d e - f i e l d
(ptrl+program counter1)->opcode-field;
( p t r Z + l ) - > s o u r c e - o p e r a n d 1
(ptrl+program counter1)->source-operandl;
( p t r Z + l ) - > s o u r c e _ o p e r a n d 2 =
(ptrl+program counter1)->source operand2;
( p t r A 2 + 1 ) - > d e s t
o p e r a n d
(ptrl+program counter1)->dest operand;
transferAflagl = I;/* vaiid instruction and pass it to
decode unit */
(ptr3)->counter[O]+=l;
1
/*
==
Fetching of instructions for EAC stream
if((program -counter2 != 0)
0))
&&
*/
((ptr5)->eacqueue-full
( p t r 2 + 2 ) - > o p c o d e - f i e l d
(ptrl+program counter2)->opcode field;
( p t r z + 2 ) - > s o u r c e - o p e r a n d 1
(ptrl+program counter2)->source-operandl;
( p t r Z + 2 ) - > s o u r c e
o p e r a n d 2
=
(ptrl+program counter2)->source operand2;
( p t r - 2 + 2 - > d e s t
o p e r a n d
(ptrl+program counter2)->dest operand;
transferflag2 = I;/* val'ld instruction and pass it to
decode unit */
(ptr4)->counter[O]+=l;
1
*/
!=
/*
/*
classifying the instruction
*/
checking for jump instructions in the PIC stream
if (((ptr2+1)->opcode-field >= 13)

0))
&&
(transfer-flag1
printf(" there is a branch instruction detected

in the PIC stream\ntl)
;
( p t r 4 ) - > c o u n t e r [ ( p t r 4 ) - > f r e e- i n d e x ]
(ptr2+1)->dest operand;
(ptrz)->free-index += 1;
*/
/*
checking f o r jump i n s t r u c t i o n s i n t h e EAC s t r e a m
i f ( ( ( p t r 2 + 2 )->opcode-f i e l d >= 1 3 )
&&
( t r a n s f e r-f lag2
!= 0 ) )
{
p r i n t f ( " t h e r e i s a branch i n s t r u c t i o n d e t e c t e d
i n t h e EAC s t r e a m \ n t l ) ;
( p t r 3 ) - > c o u n t e r [ ( p t r 3 ) - > f r e e- i n d e x ]
( p t r 2 + 2 ) - > d e s t operand;
( p t r 3 ) - > f r e e-index += 1;
p r i n t f ( I t t h e i n s t r u c t i o n f e t c h e d from memory f o r P I C
s t r e a m \ntl ) ;
p r i n t f ( " o p c o d e o f p t r 2 + 1 i s
\nl1, ( p t r 2 + 1 )->opcode-f i e l d ) ;
p r i n t f (llsource operand1 of p t r 2 + 1
\nw , ( p t r 2 + l )- > s o u r c e-o p e r c n d l ) ;
p r i n t f (llsource operand2 of p t r 2 + 1
\nw, ( p t r 2 + 1 )->source operand2) ;
p r i n t f (llde-st-operand
o f p t r 2 + 1
% d
%d
%d
% d
p r i n t f ( I t t h e i n s t r u c t i o n f e t c h e d from memory f o r EAC

stream \nu) ;
p r i n t f ( I 1 o p c o d e o f p
i s
- t r 2 + 2
\nt1, ( p t r 2 + 2 ) ->opcode-- f i e l d ) ;
p r i n t f (llsource operand1 of ptr2+2
\nw , ( p t r 2 + 2 )->source -o p e r a n d l ) ;
p r i n t f (llsource operand2 of ptr2+2
\ntl , ( p t r 2 + 2 )->source operand2) ;
p r i n t f ("de-st-operand
o f p t r 2 + 2
\nM, ( p t r 2 + 2 )- > d e s t -operand) ;
% d
%d
%d
% d
p r i n t f ( I 1 t h e program c o u n t e r s a r e l i s t e d below \ n t l ) ;
f o r (i=O;i<=g;i++)
{
p r i n t f ( I 1 t h e v a l u e of c o u n t e r %d of PIC s t r e a m i s
%d \ n t l , i , ( p t r 3 )- > c o u n t e r [ i ] ) ;
1
for (i=O;i<=g;i++)
{
p r i n t f ( I 1 t h e v a l u e of c o u n t e r %d of EAC s t r e a m is
%d \nw,i, ( p t r 4 ) - > c o u n t e r [ i ] ) ;
*/
*/
/*
/*
/*
/*
/*
/*
Function to load the instruction

status unit for the PIC stream
*/
*/
*/
*/
void load i~unitl(ptrl,ptr2,ptr3~ptr4,ptr5,ptr6)

structi0 zptrl,*ptr2 ;
structil *ptr3;
int ptr6;
I
int i,j,k,l,bottom stack1,top-stackl;
bottom stackl = ptr6;
for (izl;i<=5;i++)
{
(ptr3+(ptr3)->decode-ptr)->reg-util [i]= 3 ;
)
switch((ptrl+bottom -stack1)->source-operandl)
{
case 1:
(ptr3+(ptr3)->decode-ptr)->reg-util [1]=
break;
case 2:
(ptr3+(ptr3)->decode-ptr)->reg-util[2]=
break;
case 3:
(ptr3+(ptr3)->decode-ptr)->reg -util[3]=
break;
case 4:
break;
case 5:
(ptr3+(ptr3)->decode-ptr) ->reg-util[5] =
break;
1;
1;
1;
1;
1;
1
switch((ptrl+bottom -stack1)->source-operand2)
{
case 1:
(ptr3+(ptr3)->decode-ptr) ->reg-util [l]= 1;
break;
case 2:
(ptr3+(ptr3)->decode-ptr) ->reg-util[2 ]= 1;
break;
>
>
C
-4 I
3
U 0
C, cd
>>Z$>>Z
C, m
m
Q)
G 0
Q)
"5
20 0k
rl
0 C,
C, -4
2U
m
(d
C,
-a
\O
ka
C, F:
a fd
m
kX
U
C,
-4J
a-4Jfd
a rn
ka
C,
ao
F:
k
C, Q)
a+,
k O
ru -4
C, a
a- *
d,
k *a k
C,cu
-C,
-4
F:
..
m
k
C,
l k k k
in \O
Ik
W"o a+
C,
u II -4
0 *X II
d U
fd -4
V
-4
-4
-n II
-U)d
XC, *-
arum
%?
- . *aa. -- EI
uuua
rlC,""C,
0-4 -4 -4 k
" mu+u
0
-4 m
-4 a a a
C, 5
I* * *
a
..
cdOdbQ
UC,
C
fd
5 C,
m
Fr
5 5 3 3
4k kk"
* * * * * * -OC,C,C,
F:
m m m-4
-\\\\\\$
(ptr3+(ptr3)->decode-ptr)->reg-util [1]=
break;
case 2:
(ptr3+(ptr3)->decode-ptr) ->reg-util[2]=
break;
case 3:
(ptr3+(ptr3)->decode-ptr)->reg-util[3 I=
break;
case 4:
break;
case 5:
break;
1;
1;
1;
1;
1;
switch((ptr2+bottom-stack2)->source-operand2)
{
case 1:
(ptr3+(ptr3)->decode-ptr) ->reg-util [I]=
break;
case 2:
(ptr3+(ptr3)->decode-ptr) ->reg_uti1[2]=
break;
case 3:
(ptr3+(ptr3)->decode-ptr)->reg_util[3]=
break;
case 4:
(ptr3+(ptr3)->decode-ptr) ->reg_util[4]=
break;
case 5:
(ptr3+(ptr3)->decode-ptr) ->reg-util[5] =
break;
1;
1;
1;
1;
1;
1
case 1:
(ptr3+(ptr3)->decode-ptr)->reg-util[l]=
break;
case 2:
(ptr3+(ptr3)->decode-ptr)->reg_util[2 ]=
break;
case 3:
(ptr3+(ptr3)->decode-ptr) ->reg_util[3 ]=
break;
case 4:
break;
case 5:
(ptr3+(ptr3)->decode-ptr) ->reg_util[5]=
break;
0;
0;
0;
0;
0;
/*
/*
/*
1
Decode unit
*/
*/
*/
s
t
r
u
c
t
i
0
decode-unit(ptrl,ptr2,ptr3,ptr4,ptr5,ptr6,ptr7,~tr8,~trg)
structiO *ptrl,*ptr2,*ptr3; / * input from latch, dstackl,

dstack2 */
structil *ptr4; / * sytem status pointer */
/* i n t e g e r p o i n t e r s t o
dstackl,dstack2 */
structiO *ptr7;/* general purpose elements */
{
int i,j,k,l;
int top stack1,bottom stack1,top-stack2,bottom-stack2;
top stack1 = (ptr5)-?top-stack;
bottom stack1 = (ptr5)->bottom stack;
top stack2 = (ptr6)->top stack?
bottom-stack2 = (ptr6)->bottom-stack;
/*
Loading of the PIC queue
*/
if ( ( (ptr5)->queue-select == 1)
&&
(ptr5)->full-queue
!= 1))
{
/* check to see whether the instruction is a valid

instruction for PIC stream */
if ( (ptrl + 1)->valid != 0)
{
/*
instruction is valid
*/
( p t r 2 + t o p- s t a c k 1 ) - > o p c o d e - f i e l d =
(ptrl+l)->opcode field;
(ptr2+top- stack1)->source- operand1 =
(ptrl+l)->source operandl;
(ptr2+top- stack1)->source- operand2 =
(ptrl+l)->source operand2;
(ptr2+top_stackl)->dest-operand =
(ptrl+l)->dest-operand;
(ptr5)->top-stack+=l;
1
/*
Loading of the EAC queue
*/
if ( ( (ptr6)->queue-select == 1)
!= 1))
{
&&
(ptr6)->full-queue
/* check to see whether the instruction is a valid

instruction for EAC queue */
if((ptr1 + 2)->valid != 0)
{
/*
instruction is valid
*/
( p t r 3 + t o p- s t a c k 2 ) - > o p c o d e - f i e l d
(ptrl+2)->opcode field;
(ptr3+top- stack2)->source-operand1
(ptrl+2)->source operandl;
(ptr3+top- stack2)->source - operand2
(ptrl+2)->source operand2;
( p t r 3 + t o p- s t a c k 2 ) - > d e s t - o p e r a n d
(ptrl+2)->dest-operand;
(ptr6)->top-stack+=l;
/*
Forwarding the instruction to the decoder
if (current-queue
==
*/
1)
switch((ptr2+bottom-stack1)->opcode-field)
{
case 1:
/*
*/
(ptr4+(ptr4)->decode ptr)->opcode = 1;
(ptr4+(ptr4)->decode-ptr)
->exec-time = 3 ;
load-isunitl(ptr2,ptr3,ptr4,ptr5,ptr6,bottom -stackl);
break;
case 2:
/*
*/
(ptr4+(ptr4)->decode ptr) ->opcode = 2 ;

( p t r 4 + ( p t r 4 ) - > d e c o de p t r ) - > e x e c-time = 3;
load-isunitl(ptr2,ptr3,ptr4,ptr5,ptr6,bottom
break;
-stackl);
case 3:
/*
*/

( p t r 4 + ( p t r 4 ) - > d e c o d e-p t r ) - > e x e c -time = 8;
load-i~unitl(ptr2,ptr3,ptr4~ptr5~ptr6~bottom
-stackl);
break;
=
=
=
=
case 4:
/*
*/
(ptr4+(ptr4)->decode-ptr)->opcode = 4 ;
(ptr4+(ptr4)->decode-ptr) ->exec-time = 23 ;
break;
case 5:
/*
*/
(ptr4+(ptr4)->decode ptr)-Bopcode = 5;
(ptr4+(ptr4)->decode-ptr)->e~ec
-time = 6;
load-isunitl(ptr2,ptr3,ptr4,ptr5,ptr6,bottom -stackl):
break;
case 6:
/*
*/
->exec-time = 6;
load-isunitl(ptr2,ptr3,ptr4,ptr5,ptr6,bottom -stackl):
break;
case 7:
/*
*/
load-isunitl(ptr2,ptr3,ptr4,ptr5,ptr6,bottom-stackl):
break;
case 8:
/*
*/
break;
case 9:
/*
*/
(ptr4+(ptr4)->decode ptr)->opcode = 9 ;
->exec-time = 3 ;
-
load-isunitl(ptr2,ptr3,ptr41ptr51ptr61b~tt~m
-stackl);
break;
case 10:
/*
*/

(ptr4+(ptr4)->decodeWptr)
->exec-time = 3 ;
break;
case 11:
/*
*/
(ptr4+(ptr4)->decode ptr) ->opcode = 11;

->exec-time = 3 ;
load i~unitl(ptr2,ptr3,ptr4~ptr5~ptr6~bottom
-stackl);
break;
case 12:
/*
*/
(ptr4+(ptr4)->decode-~tr)
->exec-time = 3 ;
break;
case 13:
/*
*/

->exec-time = 3 ;
break;
case 14:
/*
*/
(ptr4+(ptr4)->decode7ptr)
->exec-time = 3 ;
load-isuniti(ptr2,ptr3,ptr4,ptr5,ptr6,bottom -stackl);
break;
case 15:
/*
*/
(ptr4+(ptr4)->decode-ptr)->exec-time
3;
load-is~nitl(ptr2,ptr3,ptr4~ptr5,ptr6~bottom
-stackl);
break;
/*
forwarding the instruction to the issue unit
*/
if(disab1e decode != 1)
{
( p t r 7 + 3 ) - > o p c o d e - f i e l d
(ptr2+bottom-stack1)->opcode-field;
(ptr7+3)->source-operand1
(ptr2+bottom-stack1)->source-operandl;
(ptr2+bottom stack1)->source operand2;
(ptr7+3)->dest
- o p e r a n d
(ptr2+bottom-stack1)->dest-operand;
/*
rearranging the stack

for (i=2;i<=20;i++)
=
=
=
=
*/
(ptr2+(i-1))->opcode-field
(ptr2+i)->opcode-field;
(ptr2+(i-1))->source-operand1
(ptr2+i)->source-operandl;
(ptr2+(i-1))->source - operand2
(ptr2+i)->source-operand2 ;
(ptr2+(i-1))->dest -operand
(ptr2+i)->dest-operand;
(ptr5)->top-stack-=l ;
1
}
if(current-queue
{
==
2)
switch((ptr3+bottom -stack2)->opcode-field)
{
case 1:
/*
*/

(ptr4+(ptr4)->decode-~tr)
->exec-time = 3 ;
load-isunit2(ptr2,ptr3,ptr41ptr51ptr61b~tt~m
-stack2);
break;
case 2:
/*
*/
=
=
=
=
(ptr4+(ptr4)->decode-ptr)->opcode = 2;
load-isunit2(ptr2,ptr3,ptr4,ptr5,ptr6,bottom -stack2);
break;
case 3:
/*
*/
break;
case 4:
/*
*/

break;
case 5 :
/*
*/
load-isunit2(ptr2,ptr3,ptr4,ptr5,ptr6,bottom
break;
-stack2);
case 6:
/*
*/

->exec-time = 6 ;
load-isunit2(ptr2,ptr3,ptr4,ptr5,ptr6,bottom-stack2);
break;
case 7:
/*
->exec-time = 3 ;
load-i~unit2(ptr2,ptr3,ptr4~ptr5~ptr6,bottom
stack2);
break;
case 8:
/*
*/

->exec-time = 3 ;
load-i~unit2(ptr2,ptr3,ptr4~ptr5~ptr6~bottom
-stack2);
break;
case 9:
/*
*/
(ptr4+(ptr4)->decodeptr)
->exec-time = 3 ;
break;
case 10:
/*
*/

->exec-time = 3 ;
-stack2);
break;
case 11:
/*
*/

(ptr4+(ptr4)->decode-ptr)->exec
-time = 3 ;
load-is~nit2(ptr2,ptr3,ptr4~ptr5~ptr6~bottom
-stack2);
break;
case 12:
/*
*/
->exec-time = 3 ;
-stack2);
break;
case 13:
/*
*/

( p t r 4 + ( p t r 4 ) - > d e c o de p t r ) - > e x e c-time = 3;
-stack2);
load-isunit2(ptr2,ptr3,ptr4,ptr5,ptr51ptr61bottom
break;
case 14:
/*
*/
(ptr4+(ptr4)->decode-ptr) -Bopcode = 14 ;
break;
case 15:
/*
*/

(ptr4+(ptr4)->decode-ptr)->exec-time
= 3;
break;
/*
forwarding the instruction to the issue unit
*/
if(disab1e-decode != 1)
{
( p t r 7 + 4 ) - > o p c o d e - f i e l d
(ptr3+bottom-stack1)->opcode-field;
(ptr3+bottom-stack1)->source-operandl;
(ptr7+4)->source_operand2
(ptr3+bottom-stack1)->source operand2;
(ptr7+4)->dest
-operand
(ptr3+bottom-stack1)->dest-operand;
/*
rearranging the stack

for (i=2;i<=20;i++)
=
=
*/
(ptr3+(i-1))->opcode- field
(ptr3+i)->opcode-field;
(ptr3+(i-1))->source - operand1
(ptr3+i)->source-operandl;
(ptr3+ (i-1)) ->source - operand2
(ptr3+i)->source-operand2;
(ptr3+(i-1))->dest -o p e r a n d
(ptr3+i)->dest
-operand;
(ptr6)->top-stack-=l;
1
1
(ptr4)->decode ptr+=l;
if ((ptr4)->decode-ptr
{
==
20)
=
=
=
=
(ptr4)->decode-ptr
0;
return (*ptr7);
1
/*
/*
/*
/*
/*
*/
*/
*/
*/
Issue unit
structi5
structiO
structil
structi6
structi5
*/
issue unit(ptrl,ptr2,ptr3,ptr4,ptr5,ptr6,ptr7)
*ptrlT/* pointers to the latches */
*ptr2;
*ptr3,*ptr4;/* decode stack pointers */
*ptr7;
int i,j,k,l;
int issue-pointer, dest-ptr, srcl-ptr, src2-ptr;
i
n
templ,temp2,temp3,temp4,raw delay,waw delaytinst
-delay;
issue-pointer = (ptr2)1>issue-ptr;
/*
issue logic for PIC stream
if((current -queue
==
1)
&&
*/
(disable-issue != 1))
1 =
t
e
m
P
(ptr2+issue-pointer)->count-units[(ptrl+3)->dest -operand];
temp2 = (ptr2+issue-pointer)->count -units
[(ptrl+3)->source operandl];
temp3 =(ptr2+issue -pointer)->count -units
[(ptrl+3)->source operand21;
temp4 =-(ptr2+issue -pointer)->exec-time;
/*
computing RAW hazards
if ( (temp2
==
0)
&&
(temp3
raw-delay
0;
raw-delay
temp2;
raw-delay
temp3;
*/
==
0))
if (temp2 > temp3)

{
raw-delay = temp2;
1
else
{
/*
raw-delay = temp3;
checking for WAR hazards
if (templ
== 0)
waw-delay
1
if((temp1 != 0)
{
1
else
{
waw-delay
raw-delay
&&
1;
(temp1 <= (raw-delay+temp4)))
templ + 2 ;
waw-delay = templ-temp4+3;
/*
computing total delay */

if( raw-delay > waw-delay)
{
1
else
inst-delay
raw delay + 1;
inst-delay
waw-delay + 1;
{
}
/*
register
*/
updating the counter associated with the sink
(ptrZ+issue pointer)->count-units[(ptrl+3)->dest-operand]
inst-delay + temp4 - 1;
issue pointer += 1;
(ptr2y->issue ptr = issue-pointer;
if((ptr2)->is%ue -ptr == 20 )
{
(ptr2)->issue-ptr
1
= 0;
/*
issue logic for EAC stream
if((current -queue
2)
==
*/
(disable-issue != 1))
&&
t e m p l = (ptr2+issue-pointer)->count -units
[(ptrl+4)->dest operand];
t e m p F = (ptr2+issue-pointer)->count -units
[(ptrl+4)->source operandl];
temp3 r(ptr2+issue -pointer)->count -units
[(ptrl+4)->source operand2J;
temp4 =-(ptr2+issue -pointer)->exec-time;
/*
computing RAW hazards
if ( (temp2
==
0)
*/
(temp3
&&
==
0))
raw-delay
1
if ( (temp2 > 0)
0;
(temp3
&&
==
0))
raw-delay
temp2;
raw-delay
temp3;
1
if ((temp2 > 0)
(temp3 > 0))
&&
if(temp2 > temp3)

{
raw-delay
temp2;
else
{
raw-delay = temp3;
/*
checking for WAR hazards
if (templ
{
==
0)
waw-delay
raw-delay + 1;
if((temp1 != 0)
&&
(temp1 <= (raw-delay+temp4)))
waw-delay
}
else
{
temp1
+ 2;
waw-delay
templ-temp4+3;
/*
computing total delay */

if( raw-delay > waw-delay)
{
inst-delay
raw-delay + 1;
1
else
{
inst-delay = waw-delay + 1;
register
/*
*/
updating the counter associated with the sink
(ptr2+issue pointer)->count-units[(ptrl+3)->dest-operand]
inst-delay + temp4 - 1;
issue pointer += 1;
(ptr27->issue-ptr = issue pointer;
if ( (ptr2)->issue-ptr == 2 0 )
1
return (*ptr7);
}
/*
/*
/*
/*
/*
/*
Function
*/
*/
Initializations
*/
*/
*/
void initialize(numl,num2)
struct7 *numl;
struct8 *num2;
{
int i,j,k,l;
for(i=l;i<=40;i++)
{
(numl+i)->location = i;
(numl+i)->func = 0;
(num2+i)->destination = i;
/*
/*
/*
/*
/*
/*
*/
*/
Function Re-Initializations */
*/
*/
*/
struct7 reinit (numl)

struct7 *numl;
{
struct7 *temp;
int ifjfkfl;
temp = numl;
1 = 1;
for(i=l;i<=40;i++)
{
if ( (numl+i)->func
{
== 5)
(numl+i)->location
1 = 1+ 1;
1
else if((numl+i)->func
{
1;
==
4)
(numl+i)->location
1;
1
else
1
numl = temp;
return (*numl);
1
/*
/*
/*
/*
/*
/*
Function stage 1
*/
*/
*/
*/
*/
*/
struct2 stage-one(number-one,number-two,num-three,num-passl,
num pass2)
structl *number one,*number-two;
struct2 *num three;
int num-passi,num-pass2;
{
int i,jfkfl;
structl *pl,*p2;
struct2 *p3;
pl = number one;
p2 = number-two;
p3 = nurn three;
for (i=0;i<8;++i)
{
( n u m t h r e e + = i) - > w o r d [ i + j ] =
(number one += num-passl) ->bits [j]) * ( (number-two +=
num-pass27->bits[i]) ;
number one = pl;
number-two = p2;
num-thTee = p3;
1
1
printf ( " \nw);
printf(I1 the partial products calculated in function stage1
are as follows \nI1);
printf ( " \nI1);
printf(" \nw);
printf (I1 \nu);
for(i=O;i<8;++i)
(
printf(I1The partial product W %d is\nw,i);

printf ("\nI1);
for(j=O;j<l6;++j)
{
i)->word[j])
p r i n t f ( I 1 % d 11,( n u m - t h r e e + =
num-three = p3;
1
printf (I1\n1l)
;
printf (lt\nll)
;
1
return (*num-three) ;
1
/*
/*
/*
/*
Function Stage 2
/*
*/
*/
*/
*/
*/
struct2 stage t w o ( n u m l , n u m 2 , n u m 3 , n u m 4 , n u m 5 , n u m 6 ,
num7,num8,num9,num10)
struct2 * n u m l , * n u m 2 , * n u m 3 , * n u m 4 1 * n u m 5 1 * n u m 6 1
*num7,*num8,*num9,*numlO;
{
int i,j,k,l,nega,negb,negc;
struct2 *pl,*p2,*p3,*p4,*p5,*p6,*p71*p8~*p9,*p10;
pl = numl;
p2 = num2;
/*
p3 = num3;
p4 = num4;
p5 = num5;
p6 = num6;
p7 = num7;
p8 = num8;
p9 = num9;
p10 = numl0;
realization of csa unit number one
for(i=O;i<l6;++i)
{
nega = 0;
negb = 0;
negc = 0;
if ( (numl)->word[i]
==
0)
nega = 1;
1
if ( (num2)->word[i] == 0)
{
negb
1;
1
if ( (num3)->word[i]
{
/*
negc
==
0)
1;
1
realization of csa unit number two
for(i=o;i<l6;++i)
{
nega = 0;
negb = 0;
negc = 0;
{
nega
negb
negc
==
0)
1;
1
if ((num6)->word[i]
{
0)
1;
1
{
==
==
1;
0)
*/
/*
/*
/*
/*
/*
*/
Function Stage 3
*/
*/
*/
*/
struct2 stage t h r e e ( n u m 1 , n u m 2 , n u m 3 ~ n u m 4 , n u m 5 , n u m 6 ,
num7,num8,num9,~um10)
struct2 * n u m l , * n u m 2 , * n u m 3 , * n u m 4 , * n u m 5 , * n u m 6 ,
*num7,*num8,*num9,*numlO;
{
/*
struct2 *pl,*p2,*p3,*p4,*p5,*p6,*p61*p71*p81*p91*p10;
pl = numl;
p2 = num2;
p3 = num3;
p4 = num4;
p5 = num5;
p6 = num6;
p7 = num7;
p8 = num8;
p9 = num9;
p10 = numl0;
realization of csa unit number three
for(i=O;i<l6;++i)
{
nega = 0;
negb = 0;
negc = 0;
if ( (numl)->word[i] == 0)
{
nega = 1;
1
if ((num2)->word[i]
{
== 0)
negb = 1;
1
if((num3)->word[i]
==
negc
1;
0)
num7->word[i] = ((((numl->word[i]*negb*negc)
I (num2->word[i]*nega*negc)) ( (num3->word[i]*nega*negb)) I
(numl->word[i]*num2->word[i]*num3->word[i]));
n u m 8 - > w o r d [ i + l ]
= ( ( n u m l - > w o r d [ i ] * n u m 2 - > w o r d [ i ] )
I
( n u m 3 - > w o r d [ i ] * n u m l - > w o r d [ i ] )
I
(num2->word[i]*num3->word[i]));
/*
realization of csa unit number four

for(i=O;i<l6;++i)
*/
nega = 0;
negb = 0;
negc = 0;
if ( (num4)->word[i] == 0)
{
nega = 1;
if ( (num5)->word[i] == 0)
{
negb
= 1;
negc
1;
num9->word[i] = ((((num4->word[i]*negb*negc)
I(num5->word[i]*nega*negc)) 1 (num6->word[i]*nega*negb)) 1
(num4->word[i]*num5->word[i]*num6->word[i]));
n u m l 0 - > w o r d [ i + l ]
= ( ( n u m 4 - > w o r d [ i ] * n u m 5 - > w o r d [ i ] )
I
( n u m 6 - > w o r d [ i ] * n u m 4 - > w o r d [ i ] )
I
return(*num7,*num8,*num9,*numl0);
/*
/*
/*
/*
/*
Function Stage 4
*/
*/
*/
*/
*/
struct2 stage four(numl,num2,num3,num4,num41num5)

struct2 *numl~*num2,*num3,*num4,*num5;
{
struct2 *pl1*p2,*p3,*p4,*p5;
pl = numl;
p2 = num2;
/*
p3 = num3;
p4 = num4;
p5 = num5;
realization of csa unit number five
for(i=O;i<16;++i)
{
nega = 0;
negb = 0;
negc = 0;
==
0)
nega = 1;
if((num2)->word[i]
==
0)
negb
1;
if((num3)->word[i]
==
0)
negc
num4->word[i]
1;
((((numl->word[i]*negb*negc)
I (num2->word[i]*nega*negc)) I (num3->word[i]*nega*negb)) I
=
n u m 5 - > w o r d [ i + l
1
return(*num4,*num5);
/*
/*
/*
/*
/*
*/
Function Stage 5
*/
*/
*/
*/
struct2 stage five(numl,num2,num31num41num5)

struct2 *numl~*num2,*num3,*num4,*num5;
{
/*
struct2 *pl,*p2,*p3,*p4,*p5;
pl = numl;
p2 = num2;
p3 = num3;
p4 = num4;
p5 = num5;
realization of csa unit number six
for(i=O;i<l6;++i)
{
nega
0;
]
I
1
negb = 0;
negc = 0;
if ( (numl)->word [i]
{
nega
==
0)
negb = 1;
1
if ((num3)->word[i]
{
0)
1;
1
{
==
==
0)
negc = 1;
1
num4->word[i]
((((numl->word[i]*negb*negc)
I (num2->word[i]*nega*negc)) I (num3->word[i]*nega*negb)) I
n u m 5 - > w o r d [ i + l
I
I
1
return(*num4,*num5);
}
/*
/*
/*
/*
/*
Function Stage 6
*/
*/
*/
*/
*/
struct2 stage six(numl,num2,num3,num4)

struct2 *numl~*num2,*num3;
structl *num4;
{
int i,j,k,lfnegafnegb,negc,carry[l7];
struct2 *pl,*p2,*p3;
structl *p4;
pl = numl;
p2 = num2;
p3 = num3;
p4 = num4;
carry[O] = 0;
/* here the distinction is being made between add
the rest */
if((addition == 1) I I (subtraction == 1))
{
if(addition
==
1)
printf (I1 addition is one \nu);

for(i=O;i<=7;i++)
{
&
sub and
(numl)->word[7-i] = (p4+1)->bitsCil;
(num2)->word[7-i] = (p4+2)->bits[il;
addition = 0;
for(j=8;j<=16;j++)
{
1
if(subtraction
{
==
1)
printf ( " subtraction is one \n1I);

for(i=O;i<=7;i++)
{
/*
(numl)->word[7-i] = (p4+1)->bits[i];
inverting the operand */
if ( (p4+2)->bits[i] == 1)
{
(p4+2)->bits[i] = 0;
1
else
{
(p4+2)->bits [i]
1;
1
carry[O] = 1;
subtraction = 0;
for(j=8;j<=16;j++)
{
1
1
printf ("\nn);
printf (I1\nl1)
;
printf(" the following is the entered numbers\nl1);
printf ("\nu);
printf (I1\nl1)
;
printf(I1 the value of numl loaded is (16 - O)\nl1);
for(j=O;j<=l6;j++)
{
printf (I1 %d 11,numl->word[16-j

])
printf ("\n") ;
printf ("\nI1);
printf(I1 the value of num2 loaded is (16
for(j=O;j<=16;j++)
{
printf (I1 %d ",num2->word[l6-j] )
O)\nll);
printf ("\nl1);
printf ("\nW);
/*
realization of carry propagation adder

for(i=O;i<16;++i)
{
nega = 0;
negb = 0;
negc = 0;
== 0)
nega = 1;
1
if((num2)->word[i] == 0)
{
negb = 1;
1
if(carry[i] == 0)
{
negc = 1;
num3->word[i] = ((((numl->word[i]*negb*negc)
I (num2->word[i]*nega*negc)) I (carry[i]*nega*negb)) I
(numl->word[i]*num2->word[i]*carry[i]));
carry[i+l] =((numl->word[i]*num2->word[i])
I (carry[i]*numl->word[i]) 1 (num2->word[i]*carry[i]));
1
return (*num3);
1
/*
/*
/*
/*
/*
Function Delay One
*/
*/
*/
*/
*/
struct2 delay one(numl,num2,num3,num4)

struct2 *numl~*num2,*num3,*num4;
{
int i,j,k;
struct2 *PI,*p2,*p3,*p4 ;
pl = numl;
p2 = num2;
p3 = num3;
p4 = num4;
for (i=O;i<l7;i++)
{
num3->word[i] = numl->word[i];
num4->word[i] = num2->word[i];
return (*num3,*num4) ;
/*
/*
/*
/*
/*
unction Delay Two
*/
*/
*/
*/
struct2 delay two(numl,num2)

struct2 *numl,*num2;
{
int i,j,k;
struct2 *pl,*p2;
pl = numl;
p2 = num2 ;
for (i=O;icl7 ;i++)
{
num2->word[i] = numl->word[i];
return (*num2);
/*
/*
starting of the pipeline */

entering the values of the arguments
*/
void pipeline()
{
s
t
r
u
c
t
2
*pone,*ptwo,*pthree,*pfourI*pfive,*psix,*pseven,*peight;
struct2 *pnine,*pten;
int one,two,i,j,k,l,m,v;
printf (I1 \nn);
printf ( " \nu);
/*printf(" enter the argument one from bit 8 to 0 \nlt);
s c a n f ( " % d
% d
% d
% d
% d
% d
% d
% d
%d~,&argumentl[O].bits[8]I&argument1[O].bits[7],
&argument1[0].bits[6],&argument1[0].bits[5],
&argument1[0].bits[4],&argument1[0].bits[3],
&argumentl[O].bits[2]I&argument1[O].bits[l]f
&argumentl[O].bits[O]);
printf (I1 \ntl
) ;
printf (It \nw);
printf(I1 enter the argument two from bit 7 to 0 \nV1);
s c a n f ( II % d % d % d % d % d % d %
%d11,&argument2[0].bits[7]I&argument2[O].bits[6],
&argument2[0].bits[5]I&argument2[O].bits[4]l
&argument2[0].bit~[3]~&argument2[O].bits[2],
&argument2[0].bits[l], &argument2[0].bits[O]);
printf ( " \ntl)
;*/
printf ( " the data is fed from mpreg + 3,4 \nI1);
if(multip1ication == 1)
{
multiplication = 0;
for(j=O;j<=7;j++)
{
(argl-pointer+O)->bits[7-j] = (mpreg-ptr + 3)->bits[j];

(arg2-pointer+O)->bits[7-j 1 = (mpreg-ptr + 4)->bits[j 1 ;
1
1
if ( (division == 1)
I (delta-flag
==
1))
division = 0;
delta flag = 0;
for(j=0;j<E7;j++)
{
(arg2-pointer+0)->bits[7-j 1
(mpreg-ptr
(argl-pointer+0)->bits[8-j1
(mpreg-ptr + 4)->bits[ j1 ;
1
1
if (init-key
==
3)->bits[ j] ;
1)
init key = 0;
for(T=o;j<=8;j++)
{
(argl pointer+O)->bits[8-j]
(arg2pointer+0)->bits[8-j]
-
=
=
0;
0;
printf (I1 the arguments are initialised to zero in

init-key\nI1);
1
printf (I1 \nw);
printf (I1 the argument one is printed below 8 - 0 bits in
correct place\nV1);
printf (I1 \nw);
printf (I1 \nI1);
for(j=O;j<=8;j++)
{
printf (I1 %d 11,(argl-pointer+O)->bits[8-j1 ) ;

1
printf (I1 \nI1);
printf ( " \nw);
printf (I1 the argument two is printed below 8
correct place\nI1);
printf ( " \nw);
printf (I1 \nI1);
for(j=O;j<=8;j++)
{
printf (I1 %d 11,(arg2-pointer+0)->bits[8-j1 )
0 bits in
printf ( " \nn);

printf (I1 \nl#)
;
one = 0;
two = 0;
stage-one(arg1 -pointer,arg2-pointer,par-pointer,one,two);
printf (I1 \nn);
printf (I1 \nw);
printf (I1 \nl1);
printf ( " the arguments are \nw);
printf (I1 \n1#)
;
printf (I1 \nw);
printf(I1 argument1 is 8 - O\nl1);
printf (I* \nl@)
;
for (i=0;i<=8;++i)
{
printf(I1 %d w,argumentl[O].bits[7-i]);
1
printf ( " \nw);
printf (I1 \nI1);
printf (I1 argument2 is 8 - O\nn);
printf ( " \nu);
for (i=0;i<=8;++i)
{
printf (I1 %d 11,argument2

[o] .bits[8-i]) ;
)
printf ( " \nu);

printf ( " \nw) ;
printf(I1 the partial products calculated after function
stage1 are as follows \n1I);
printf ( " \ngl)
;
printf ( " \nM);
printf ( " \n1#)
;
for(i=O;i<8;++i)
{
printf (nThe partial product W %d is\nV1,i)

;
printf (I1\nl1)
;
for(j=O;j<l6;++j)
{
printf
par-pointer
1
printf (I1\nt1)
;
printf ("\nll);
/*
stage two starts now

par pointer = copy three;
lat-pointer = copy-four;
trans pointer = copy five;
delayptr
= copy-six:
(I1
%d
" , ( p a r- p o i n t e r + =
copy-three;
*/
pone = trans pointer + 0;

ptwo = transpointer + 1;
pthree = trans pointer + 2;
pfour = trans pointer + 3;
pfive = transpointer + 4;
psix = trans pointer + 5;
pseven = lat-pointer + 0;
peight = lat-pointer + 1;
pnine = lat pointer + 2;
pten = lat pointer + 3;
s t a g e- t w ~ ( p o n e , p t w o , p t h r e e , p f o u r , p f i v e , p s i x ,
pseven,peight,pnine,pten);
pone = delay ptr + 3;
ptwo = delayptr + 4;
pthree = delay ptr + 0;
pfour = delay ptr + 1;
delay one (pone,ptwo,pthree,pfour) ;
~ r i n t f (\nn)
~ ;
printf ( " \nw);
printf ( " \nu);
print(" THE SUM AND CARRY VECTORS OF STAGE TWO \n1I);
printf ( " \nl1);
printf(I1 FIRST AND THIRD ARE SUM VECTORS S1 AND S2 \n1I);
printf ( " \nw);
printf (I1 SECOND AND FOURTH ARE CARY VECTORS C1 AND C2 \nl');
printf (I1 \nu);
printf ( " \ntl)
;
printf (I1 \nM);
for(i=O;i<4;++i)
{
printf ( " \nw);

for(j=O;j<l6;++j)
{
p r i n t f ( " % d 11,( l a t- p o i n t e r + =
i)->word[j]) ;
lat-pointer
1
printf (I1\nw)
;
printf ("\nI1);
}
/*
copy-four;
stage three starts now

trans pointer = copy-five;
delay-ptr
= copy six;
pfour = trans pointer + 9;
pfive = transpointer + 10;
psix = trans-pointer + 11;
*/
pseven = lat pointer + 4;

peight = latpointer + 5;
pnine = lat pointer + 6;
pten = lat pointer + 7 ;
stage t h F e e ( p o n e , p t w o , p t h r e e , p f o u r , p f i v e , p s i x ,
pseven ypeight,pnine,pten) ;
printf ( "
printf ( "
printf (I1
printf("
printf (I1
printf("
printf ( "
printf (I1
\nu);
\nI1);
\nit) ;
THE SUM AND CARRY VECTORS O F STAGE THREE \ n n ) ;
\n1I);
F I R S T AND T H I R D ARE SUM VECTORS S3 AND 54 \nI1);
\nw);
SECOND AND FOURTH ARE CARY VECTORS C 3 AND C 4 \nrl)
;
print f (I1 \n1I);

printf (I1 \nw);
printf ( " \ntl)
;
for (i=4;ic8 ;++i)
{
printf ( " \nN);

for(j=O;jc16;++j)
{
i)->word[j])
printf
lat-pointer
1
printf (If\nn);
printf ("\nW);
}
(I1
% d It,( l a t- p o i n t e r + =
copy-four;
stage four starts now */

latpointer = copy-four ;
= copy six;
delay-ptr
ptwo = trans-pointer + 13;
pfour = lat pointer + 8;
pfive = lat-pointer + 9;
stage-four(~one,ptwo,pthree,pfour,pfive);
pone = delay ptr + 5;
ptwo = delay-ptr + 2;
delay two (pone,ptwo) ;
printf ( " \nn);
printf (I1 \nu);
printf(" THE SUM AND CARRY VECTORS O F STAGE FOUR \n1I);
printf (I1 \nw);
printf ( " F I R S T I S THE SUM VECTOR S5 \nI1);
printf (I1 \nu);
printf(" SECOND I S THE CARRY VECTOR C 5 \nVt)
;
/*
printf (I1 \nI1);

printf ( " \nw);
printf (I1 \n1I);
for(i=8;i<lO;++i)
{
+=
printf ( " \nit) ;

printf (It the value of pointer is %~\n~~,lat_pointer
i);
lat pointer = copy-four;
printf ("\nll);
for(j=d;j-<l6;++j)
{
i)->word[j ] )
printf
(I1
% d It,( l a t- p o i n t e r
+=
lat-pointer
copy-four;
printf (n\nll)
;
printf ("\nN);
1
stage five starts now */

par pointer = copy-three;
= copy six;
delay-ptr
pfour = lat pointer + 10;
pfive = lat-pointer + 11;
stage five(~one,ptwo,pthree,pfour,pfive);
printf (I1 \nu);
printf (I1 \nn);
printf ( " \nw);
printf (I1 THE SUM AND CARRY VECTORS OF STAGE FIVE \nIf)
printf (I1 \ngl)
;
printf(I1 FIRST IS THE SUM VECTOR S6 \nn);
printf (I1 \nl1);
printf(I1 SECOND IS THE CARRY VECTOR C6 \nl1);
printf ( " \nw);
printf ( n \nn);
printf(" \nn);
for(i=lO;i<l2;++i)
/*
printf (If \nv1)

;
for(j=O;j<l6;++j)
{
p r i n t f ( " % d !I, ( l a t- p o i n t e r + =
i)->word[j]) ;
lat-pointer
1
printf (l1\nN)
;
printf ("\n1I);
copy-four;
/*
stage six starts now */

lat-~ointer= copyIfour;
delayxptr = copy six;
pone = trans ~ointer+ 18;
pthree = lat-pointer + 12;
-ptr);
stage ~ix(~o~e,ptwo,pthree,mpreg
(I1
\n1I);
printf (I1 \nI1);
printf (I1 \nl1);
printf(" THE PRODUCT O F STAGE S I X \n1I);
printf ( I \n1I);
printf (I1 \nw);
for(j=O;j<l6;++j)
{
printf
(I1
% d It,( l a t- p o i n t e r + =
12)->word[(15-j) 1);
( " \ n n );
printf ("\n1I);
/* stage seven */
lat~ointer= copy-four;
delay-ptr
= copy six;
ptwo = delayPptr + 7;
delay t~o(~one,ptwo);
~ r i n t (f" \nI1);
printf ( " \nI1);
printf ( " \nI1);
printf(" THE RESULT O F STAGE SEVEN \nl');
printf (I1 \nw);
printf (I1 \nI1);
for(j=O;j<16;++j)
{
7)->word[(15-j)])
printf
(I1
%d
It,( d e l a y - p t r
+=
delay-ptr = copy-six;
}
return ;
/*
this stage represents the off period of the clock cycle
*/
struct2 time-off()
{
int one,two,i,j,k,l,m,v;
for(v=O;v<l7;v++)
{
(trans-pointer
+=
0)->word[v]
(par-pointer += 0)->word[v];
(par-pointer
(par-pointer

trans-pointer = copy-five;
1
for(v=O;v<l7;v++)
{
(trans-pointer
+=
1)->word[v]
+= 1)->word[v];

par-pointer = copy-three;
1
for(v=O;v<l7;v++)
{
(trans-pointer
+=
2)->word[v]

= copy-three;
par1
for(v=O;v<l7;v++)
{
(trans-pointer += 3)->word[v]
1
for(v=O;v<l7;v++)
{
(trans-pointer
+=
4)->word[v]
+= 4)->word[v];

1
for (v=0;v<l7;v++)
{
(trans-pointer
+=
5)->word[v]

1
for(v=O;v<l7;v++)
{
(delay ptr += 3)->word[v]

delay ptr = copy six;
1

delay ptr = copy-six;
1
for(v=O;v<l7;v++)
{
(trans-pointer
+=
(par-pointer
+= 7)->word[v];
(delay-ptr
+= 0)->word[v] ;
(delay-ptr
+= 1)->word[v];
(lat-pointer
+= 0)->word[v];
(lat-pointer
+= 1)->word[v];
(latjointer += 2)->word[v];
(lat-pointer
10)->word[v]

d e l a~ - ~ t=
r copy six?
1
for(v=O;v<l7;v++)
{
(trans-pointer
+=
11)->word[v]

delayptr
= copy-six;
1
for (v=O;v<l7;v++)
{
(transjointer += 6)->word[v]
1
for(v=O;v<l7;v++)
{
{
(trans-pointer
+=
8)->word[v]

lat-~ointer= copy-four;
1
for(v=O;v<l7;v++)
{
+= 3)->word[v];

for (v=O;v<l7;v++)
{
(trans-pointer
+=
12)->word[v]
(lat-pointer +=

latgointer = copy-four;
1
for(v=O;v<l7;v++)
{
(trans pointer += 13)->word[v]

5)->wordTv];
1
for(v=O;v<l7;v++)
( l a t- pointer +=
(lat-pointer
( t r a n s pointer += 14)->word[v]
6)->word[v]
;
1
for(v=O;v<l7;v++)
{

delay ptr = copy-six;
1
for(v=O;v<l7;v++)
(lat-pointer
(trans-pointer += 15)->word[v]
+=
+= 7)->word[v];
(delay-ptr += 2)->word[v];

delayzptr = copy-six:
\
for (v=O;v<l7;v++)
{
(trans pointer += 16) ->word[v]

9)->word[v]
;
1
for(v=O;v<l7;v++)
(lat- pointer + =
(lat-pointer
(lat-pointer +=

8)->word[v]
;
1
for(v=O;v<l7;v++)
{

10)->word[v] ;
1
+=
( t r a n s pointer += 19)->word[v] = (lat-p o i n t e r +=

11)->word[v];
1
{

12)->word[v] ;
1
{
(trans-pointer
+=
(lat-p o i n t e r +=
2 1)->word[v] = (delay-ptr
+=
7)->word[v];
delay-ptr = copy-six;
return;
1
Function cal delta */
struct7 cal delta (bl,b2,b3,b4)
struct7 *blj
int *b2[20l1b3,b4; /* b3 = ref-num2
/*
{
, b4
struct7 *tempi;
int *temp2 [81, temp3 [ 8],temp5 ;
int i,j,k,l,carry;
/*printf(" entered cal delta\nw);*/
/ * inverting of the passed argument
for(i=l;ic=8;i++)
{
if ( * (b2[b3]+i)
{
else
==
ref-numl
*/
*/
1)
temp3[i] = 0;
1
{
temp3[i]
1;
1
/*printf (Ittheinverted value in cal-delta \nl');
for (i=l;i<=8;i++)
{
printf (
If
printf ("\nl1);*/
%d ",temp3[i])
/*
adding the one to form delta

carry =1;
i =1;
while(carry == 1)
temp3[9-i]
carry = 1;
else
0;
carry
i++ ;
*/
0;
/*printf("the converted value in cal-delta after adding

one\ngg
) ;
for (i=l;i<=8;i++)
{
printf( " %d gg,temp3[i]);

1
printf (gf\ngg)
;
*/
/ * loading of data into delta */
temp5 = b4 + 1;
for(i=l;i<=8;i++)
{
(bl
(bl
(bl
+
+
+
temp5)->num one[i] = temp3[i];

temp5)->num-two[i] = temp3 [i];
b4)->num-two[i] = temp3[i] ;
(bl + temp5)->func = 5;
(bl + b4)->num-two[O] = 1;
return (*bl);
1
/*
Function subtract load */

struct7 subtract-load(cl,c2,~3,~4,c5,~6)
struct7 *cl;
int *c2,*c3 [20], *c4 [20],c5,c6; / * c5= ref-num2 , c6
ref-numl */
*/
struct7 *tempi;
int *temp2,temp3[8],temp4[8];
int i,j,k,l,nega,negb,negc,carray,carry[93;
/*printf("entered subtract load \nn);*/
/ * the process below finds out the twos complement of
/*
inverting of the passed argument

for (i=l;i<=8;i++)
*/
temp3[i]
else
0;
/*
adding the one to form delta

carray = 1;
i = 1;
while(carray == 1)
{
else
*/
temp3[9-i] = 0;
carray = 1;
1
{
temp3[9-i] = 1;
carray = 0;
1
i++ ;
}
/*
the twos complemnt is calculated */

/*printf (I1 the twos complement of D \nIf);
for (i=l;i<=8;i++)
{
printf ( " the value of i = %d\nfl,i)

;
printf (
%d\n I f , temp3 [i]) ;
}
printf (ll\nlt)
;*/
/ * the below segment adds N and D 1 s 2's complement
*/
carry[O] = 0;
carry[l] = 0;
for (i=l;i<=8;++i)
/*
printf (I1 the value of iteration \nl'); */

nega = 0;
negb = 0;
negc = 0;
/*
printf ( I 1 the value of * (c3 [c5]+ (9-%d) )
%d\nn,i,*(c3[c5]+(9-i))); */
if ( * (c3[c5]+ (9-i)) == 0)
{
/*
nega
1;
1
printf
(If
t h e v a l u e of temp3 [9-%dl
negb
1
if(carry[i]
{
==
negc
1
temp4 [i] =
1;
0)
=
1;
*negb*negc)
I ( t e m p 3 [ 9 - i ] * n e g a * n e g c ) ) I (carry[i]*nega*negb)) I
( * (c3[c5]+(9-i) ) *temp3 [9-i]*carry[i]) ) ;
/*
printf(" the partial product of temp 4 with i =
%d\nfl,
i) ;
printf(
%d \nw,temp4[i]);
*/
carry[i+l] =(*(c3[c5]+(9-i))*temp3[9-i]) I
(carry[i]**(c3[c5]+(9-i)) I (temp3[9-i]*carry[i]));
/*
( ( ( ( * (c3[c5]+(9-i) )
printf (I1 the value of N

for (i=l;i<=8;i++)
{
printf(
D \nl*)
;
%d ",temp4[i]);
1
printf (l1\nW)
;
*/
/ * loading of N - D into num-one
for(i=l;i<=8;i++)
{
(cl
*/
c6)->num-one[9-i]
temp4 [i];
1
return (*cl);
1
Function compare & load */

struct7 compare-load(alIa2,a3,a4,a5,a6)
struct7 *al;
int *a2 [20],*a3[20] ,a4,a5,*a6; / * a4 -> ref-num2 , a5 ->
ref-numl */
{
/ * a2 -> argl; a3 -> arg2; a6 -> func */
struct7 *tempi;
i
n
t
*temp2,*temp3,temp4,*temp51iIjIk111flagone,flag-two;
/ * the comparison of the two arguments */
flag one = 0;
flag-two = 0 ;
for (T=l;i<=8;i++)
/*
if ( * (a2[a4]+i) >
{
* (a3[a4]+i) )
flag-one
i = 8;
1;
if ( * (a2[a4]+i) <
{
1
1
if(f1ag-two
* (a3[a4]+i) )
flag-two
i = 8;
==
1;
1)
for (i=l;i<=8;i++)
{
into numl
(al+a5)->num-one[i]
*/
*(a2[a4]+i);/*
argl loaded
/*printf(" the value of d is greater than n\nv);*/

(al+a5)->func = * (a6+a4);/* function value is
loaded */
/*printf(" the value of opcode loaded in compare&load
is %d\nl1,(al+a5)->func);*/
cal delta(al,a3,a4,a5) ;/* loading of delta into num2
and creating delta line */
(a1 + a5)->over-flow = 0;
1
else
{
/*
printf (I1 the value of n is greater than d\nn);*/
(al+a5)->func = * (a6+a4);/* function value is

loaded */
/*printf(" the value of opcode loaded in compare&load
is %d\nw , (al+a5)->func) ;*/
cal delta(a1, a3,a4,a5) ;/* loading of delta into num2
and creating delta line */
(a1 + a5)->over-flow = 1;
return (*al);
/*
Function pre processor */

struct7 pre proc(numl,num2,num3,num4)
struct7 *numi ;
int *num2,*num3[20],*num4[20];/* num2 -> function; num3 ->
argumentl; */
{
/* num4 -> argument4 */
/ * intialisations */
struct7 *templ,*temp2;
int *temp3,*temp4,*temp5;
int temp-flagl,temp-flag2,temp-flag3;
int i,j,k,l,ref num1,ref-num2,ref_num3;
temp1 = numlT
/ * the testing of the type of function */
ref numl = I;/* indexing pre processor structure

ref-num2 = I;/* indexing the data array */
ref-num3 = 0;
forTref-num2=l;ref-num2<=stk-ptr;ref-num2++)
*/
/*printf ("the value of the condition is %d\nI1,* (num2

ref-num2)) ;
p r i n t f ( I 1 t h e p r e s e n t v a l u e o f r e f - n u m l %d
\nn,ref numl) ;*/
swTtch ( * (num2 + ref-num2) )
case 1:
/ * printf(" case number one \nw); */
(numl+ ref numl)->func = * (num2 + ref-num2) ;
for(i=l;i<=8;i++)
{
(numl+
(numl+
1
(numl+
(numl+
ref num1)->num one[i]

ref-num1)->num-two[i]
-
=
=
*(num3[ref num2]+i);
*(num4[ref-num2]+i);
-
ref num1)->over flow = 0;

ref-num1)->weight = 0;
/*
of the input stack */
/*for(i=l;i<=ref-numl;i++)
{
printf (I1 the input stack is printed below with

ref numl) ;
ref-numl as %d \ntl,
printf (Ittheopcode is %d\nt1
, (numl + i)->func);
printf (Itthevalue of argument one is as follows
\nu);
for(j=l;j<=8;j++)
{
printf("
%d
" , (numl +
i)-mum -one[j]);
printf ("\nW);
printf(I1the value of argument two
is as follows
\nn);
for(j=l;j<=8;j++)
1
printf ("\nN);
1*/
ref numl = ref numl + 1;
/*printf ( " reached break at case one \nll);*/
break;
/*
case 2:
printf ( " case number two \nw) ;*/
(numl+ ref num1)->func = *(num2 + ref-num2);
for (i=l;icz8;i++)
(numl+ ref num1)->num one[i]

(numl+ r e fnuml)->num-two[i]
-
=
=
*(num3[ref num2]+i);
* (num4[ref-num2]+i)
;
(numl+ ref num1)->over flow = 0;

(numl+ ref-num1)->weigEt = 0;
/*~rintf(~reachedthe printing stage in case two\nI1);
printf ( " the present value of ref-numl in case 2 %d
\ngl,
ref numl) ;*/
/*-printing of the input stack */
i=O ;
/*
for(i=l;i<=ref-numl;++i)
{

ref-numl as %d \nl1,ref numl) ;
printf ("the zpcode is %d\nN,(numl + i)->func);
printf("the value of argument one is as follows
\n") ;
for(j=l;j<=B;j++)
{
i)->num-one[j])
printf ( "
% d
",(numl +
;
)
printf ("\nu);
printf("the value of argument two
\nw)
is as follows
for(j=l;j<=B;j++)
{
printf
(I1
%d
It,( n u m l +
i)->num-two [ j ]) ;
1
printf (I1\n1l)
;
1*/
ref numl = ref-numl

break;
/*
1;
case 3:
printf ( " case number three \nu);*/
(numl+ ref num1)->func = *(num2 + ref-num2) ;
for(i=l;i<EB;i++)
{
(numl+ ref numl)->num one[i] = * (num3[ref num2]+i) ;

(numl+ ref-numl)
->num-two
[i] = * (num4[ref-num2)
+i) ;
1
(numl+ ref num1)->over flow = 0;
(numl+ ref-num1)->weight = 0;
/ * printing of the input stack */
/*for(i=l;i<=ref-numl;i++)
{

ref-numl as %d \nw,ref-numl) ;
printf (Iftheopcode is %d\nI1,(numl + i)->func);

printf (Itthevalue of argument one is as follows
\nw);
for(j=l;j<=8;j++)
{
printf
%d
(I1
" , (numl +
i)->num-one[j]) ;
printf ("\nH);
is as follows
\n") ;
for(j=l;j<=8;j++)
{
printf ( "
i)->num-two[j])
%d
" ,(numl +
1
printf ("\nV1)
;
1*/
ref numl = ref-numl

break ;
1;
case 4:
/* the division case */
/*printf ( " case number four \nn);
printf (I1 entering compare load \nw);*/
compare-load (numl,num3,nui4,ref-num2,ref-numl ,num2) ;
if((num1
ref-num1)->over-flow
== 1)
subtract-load(numl,num2,num3,num41ref-num2,ref-numl);
1
)*
printing of the input stack

/*for(i=l;i<=ref -numl;i++)
*/

ref-numl as %d \nw,ref numl);
printf (Iftheopcode is %d\nl1,(numl + i)->func);
printf("the value of argument one is as follows
\nV1
) ;
for(j=l;j<=8;j++)
{
printf ( "
%d
",(numl +
i)->num-one[j ] ) ;
printf (n\nu);
printf("the value of argument two
is as follows
\nW);
for(j=l;j<=8;j++)
{
i)->num-two[ j]) ;
printf ( "
%d
11,( n u m l
1 */
printf (I1\n");
ref numl = ref-numl

break ;
1
1
return (*numl);
2;
1
void print outstack (duml)
struct8 *diml;
{
struct8 *tempi;
int i,j,k,l;
templ = duml;
for(i=O;i<=2O;i++)
{
printf ("\nl1);
printf (ll\nll)
;
~ r i n t f (the
~ original 1 . S number \nw);
printf (I1 %d \nw, (templ+i)->destination);
printf (ll\nM)
;
printf ("\nI1);
printf (I1 The result of the instruction (16
printf ("\nW);
printf (ll\nn)
;
printf (ll\nll)
;
{
printf (I1 %d It,(templ+i)->result[l6-j] )
1
printf ("\nn);
printf ("\nm);
printf (ll\nll)
;
1
1
void print-psstack(dum1)
struct9 *duml;
{
struct9 *tempi;
int i,j,k,l;
templ = duml;
for(i=O;i<=20;i++)
{
printf ("\nl1);
printf ("\nfl)
;
printf(" the tracking register number \nw);
printf (I1 %d \nvl,(templ+i)->address);
printf ("\nW);
printf (I1\nn)
;
0)\nql)
;
printf (I1the function number \nW) ;

printf (I1 %d \nu, (templ+i)->func);
printf ("\nW);
printf ("\nW);
printf (If The result of the num-one (0 - 8)\nV1)
;
printf ("\nW);
printf ("\nW);
printf ("\nW);
for(j=O;j<=8;j++)
{
printf (It %d 11,(templ+i)->num-one[ j ] )
printf (I1\nw)
;
printf (I1\nw)
;
printf (I1 The result of the num-two (0 - 8)\nw )
printf (ll\n");
printf ("\nI1);
for(j=O;j<=8;j++)
{
printf (I1 %d
11,
(templ+i)->num-two[j ] ) ;
printf (ll\n")
;
printf ("\nW);
printf (ll\n")
;
printf (l1\nl1)
;
1
}
.........................
..........................................
/******* Function Output Check **********/

..........................................
..........................
struct8 output check(num0,numl,num2,num3,num4,num41num51num61
num7, num8,num9)struct2 *numO; / * pointer to trans-pointer */
struct9 *numl; /* pointer to priority stack */
struct8 *numa; /* pointer to output structure */
struct3 *num3; / * pointer to div trac
*/
struct4 *num4; / * pointer to mult trac */
struct5 *num5; / * pointer to add trac */
struct6 *num6; / * pointer to logg sheet */
struct5 *numi'; / * pointer to sub track */
struct3 *num8; / * pointer to delta track */
structl *num9; /* pointer to multi-purpose registers*/
{
struct9 *templ; / * pointer to priority registers */

struct2 *tempo; / * pointer to cross collision matrices
struct8 *temp2;
/*
pointer to output structure
*/
*/
struct3 *temp3; / * pointer to div trac

*/
struct4 *temp4; / * pointer to mult trac */
struct5 *temp5; / * pointer to add trac */
struct6 *temp6; / * pointer to logg sheet */
struct5 *temp7; /* pointer to sub track */
struct3 *temp8; /* pointer to delta track */
structl *temp9; /* pointer to mpreg */
int i,j,k,l,remainder,local-index,future-index,get-out;
temp0 = num0;
temp2 = num2;
temp3 = num3;
temp4 = num4;
temp5 = num5;
temp6 = num6;
temp7 = num7;
temp8 = num8;
temp9 = num9;
temp1 = numl;
/ * checking of add trac */
if((temp6+1)->logg -stat-== 1)
{
printf(" add output-check is engaged \nw);

for(i=l;i<=g?i++)
{
if ( (temp6+1)->logg[i]
== 1)
if ( (temp5+i)->st-track[l]
== 1)
printf (It the output of add is being

k = (temp5+i)->address;
for(j=O;j<=l6;j++)
{
( t e m p 2 + k )- > r e s u l t [ j
(temp0+20)->word [ j ];
1
for(j=O;j<=6;j++)
{
(temp5+i)->st-track[ j ] = 0;
1
(temp6+1)->logg[i] = 0;
1
1
1
printf(" THE OUTPUT STACK AFTER LOADING ADDITION\nvv);
print-outstack(num2);
/* checking of sub trac */

{
printf(" sub-output-check is engaged \nn);
for(i=l;i<=g;i++)
{
==
1)
if ( (temp7+i)->st-track[l]
==
1)
printf (I1 the output of sub is being

1oaded\ntt
) ;
for(j=O;j<=l6;j++)
{
(temp2+k)->result[j]
(temp0+20)->word[j ];
for(j=O;j<=6;j++)
{
(temp7+i)->st-track[ j ] = 0;
(temp6+2)->logg[i] = 0;
1
1
printf ( " THE OUTPUT STACK AFTER LOADING

SUBTRACTION\nl1);
/* checking of mult trac

if ( (temp6+3)->logg-stat =: 1)
{
*/
print(" mult output-check is engaged \nw);

for(i=l;i<=g;T++)
{
if((temp6+3)->logg[i]
{
==
1)
if((temp4+i)->st -track[6]
{
==
1)
printf ( " the output of mult is being
loaded\nw) ;
for (j=O;j<=16;j++)
{
( t e m p 2 + k )- > r e s u l t
(temp0+20)->word[j ] ;
1
for(j=O;j<=8;j++)
j]
printf(" THE OUTPUT STACK AFTER LOADING MULTIPLICATION

\nrl)
;
/ * checking of div trac */

if((temp6+4)->logg-stat-== 1)
{
printf(" div output-check is engaged \nu);

for(i=l;i<=g?i++)
track
/ * Division is present and the result */

/ * is ready to be iterated or sent to */
/ * priority stack. Checking for delta */
/ * or for the iteration number in the div
registers */
/* calculating the remainder */
get out = 0;
remainder = 0;
for(j=8;j<=15;j++)
{
remainder = remainder
20)->word[j ]
(temp0
I
==
0))
if ( ( (temp3+i)->itr-track
==
4) 1 I
(remainder
printf (I1 the output of div is being

loaded\n1I);
(temp3+i)->itr track = 0;
k = (temp3+i)-,address;
for(j=O;j<=l6;j++)
{
( t e m p 2 + k )- > r e s u l t [ j ]
(temp0+21)->word[ j ];
for(j=O;j<=8;j++)
{
(temp3+i)->st track[ j ] = 0;
(temp8+i)->st-track[j]
= 0;
(temp6+4)->logg[i] = 0;
(temp6+5)->logg[i] = 0;
(temp3+i)->itr track=O;
(temp8+i)->itr-track=0;
get-out = 1;
}
if (get-out
==
0)
/*
the iteration has to be carried out */

printf(" the data is going to be stored in
P.S\n1I);
/*
loading of data in priority stack

local index = (temp9 + 0)->bits[9];
future index = local index + 1;
/* loading of data */
for(j=O;j<=7;j++)
*/
( t e m p l + l o c a l- i n d e x ) - > n u m - one[j]
(temp0+21)->word[l5-j ];
(templ+local- index) ->num -two [ j+l]
(temp0+20)->word[15-j ];
(templ+future- index)->num - one[j]
(temp0+20)->word[15-j];
(templ+future- index)->num-two[j+l]
(temp0+20)->word[15-j ];
(temp9 + 0)->bits[9] = future index+l

(templ+local index)->num two[E] = 1;
/ * setting of priority fiag */
if ( (temp9+0)->bits [9]
{
==
=
=
=
=
(temp9+0)->bits[0])
printf(" the priority flag is set to 1

(temp9+0)->bits[l]
1;
Initialising the track register */

(templ+local index)->address = i;
(templ+future index)->address= i;
(templ+local Index)->func = 4;
(templ+future index)->func = 5;
/ * initialising the registers to zero
for(j=O;jc=8;j++)
/*
*/
printf(I1 the priority stack is printed

below \nn);
print-psstack (templ);
1
)
printf (I1 THE OUTPUT STACK AFTER WADING DIVISION \ntl)

;
print-outstack(num2) ;
return (*num2)
...........................
..........................................
/******* Function Shift Track **********/
/**********************%*****************I
..........................................
void shift track(numl,num2,num3,num4,num5,num6)

struct3 *numl,*num6; /* pointer to div trac and delta track
*/
struct4 *num2; /* pointer to mult trac */

struct5 *num3,*num5; / * pointer to add trac and sub track
*/
/* pointer to logg sheet */

*templ; /* pointer to div track
*/
*temp2; / * pointer to mult track */
*temp3; /* pointer to add track */
*temp4; /* pointer to logg sheet */
*temp5; / * pointer to sub track */
*temp6; /* pointer to delta track */
struct6 *num4;
{
struct3
struct4
struct5
struct6
struct5
struct3
int i,j,k,l;
temp1 = numl;
temp2 = num2;
temp3 = num3;
temp4 = num4;
temp5 = num5;
temp6 = num6;
/* shifting of add trac */
if ( (temp4+1)->logg-stat-== 1)
{
printf(I1 add-track is engaged to be shifted\nl1);

for(i=l;i<=g;i++)
{
{
==
1)
for(j=l;j<=8;j++)
{
if ( (temp3+i)->st-track[ j ]
{
==
1)
1
1
/ * shifting of sub trac */

if((temp4+2)->logg-stat-== 1)
{
printf (It sub track is engaged to be shifted\nl1);

for(i=l;i<=gTi++)
{
{
if ( (temp5+i)->st-track[ j ] == 1)
{
(temp5+i)->st track[ j+l] = 1;

(temp5+i)->st-track[
j ] = 0;
j=9;
/ * shifting of mult-trac
if((temp4+3)->logg-stat == 1)
{
1)
for(j=l;j<=8;j++)
{
==
*/
printf (If mult track is engaged to be shifted\nIf);

for(i=l;i<=9;i++)
/ * shifting of div trac */

f
printf (Ig div track is engaged to be shifted\nl1);

for(i=l;i<=gYi++)
{
/ * shifting of delta trac

if((temp4+5)->logg-stat =
:
1)
{
*/
printf(It delta track is engaged to be shifted\ntt);

for(i=l;i<=g;iT+)
{
{
==
1)
for(j=l;j<=8;j++)
{
if ( (temp6+i)->st-track[ j ] == 1)
{
1
1
1
void status printl(numl,num2,num3)
structl *nuil;
struct5 *num2;
struct6 *num3;
{
structl *duml;
struct5 *dum2;
struct6 *dum3;
int iIj,kIl;
duml = numl;
duma = num2;
dum3 = num3;
printf("printing the pipeline register and flag
register and tracking registers and status logg\ntt);
printf (tt\ntt)
;
printf ("\ntt)
;
printf ( " the input registers ( 8 - 0 )\nI1);

printf ("\nu);
printf ("\nM);
printf(" the input register mpreg + 1 \nw);
for(j=O;j<=7;j++)
{
printf (I1 %d 11,(duml+l)->bits[7-j ]) ;

1
printf ("\nW);
printf (ll\nll)
;
printf ( " the input register mpreg + 2 \nV1)
;
for(j=O;j<=7;j++)
{
printf (I1 %d
",(duml+2)->bits[7-j 1 ) ;
printf ("\nW);
printf (It\nw)
;
printf (It 'the flag register \nI1);
printf ("\nu);
printf (ll\nll)
;
~ r i n t f (the
~ flag register is mpreg + 0 \nw);
for(i=O;i<=lO;i++)
{
printf (I1 %d
(duml+O)->bits[lo-i]) ;
1
printf (lt\nl1)
;
printf (It\nw)
;
printf(" the logging register for add\nw);
for(i=l;i<=g;i++)
{
printf ( " %d 11,(dum3+1)->logg[9-i]) ;

1
printf ("\nW);
printf ("\nW) ;
printf(" the tracking registers \nw);
for(i=l;i<=g;i++)
{
if ( (dum3+1)->logg[i]
==
1)
printf(" the tracking register number is %d and

the value of the address is %d \ntl,i, (dum2+i)->address
);
printf (I1\nw)
;
for(j=l;j<=8;j++)
{
printf (I1 %d ",(dum2+i)->st-track[8-j ])

1
printf ("\nM);
printf ("\nW);
void status print2(numllnum2,num3)

structl *nuGI;
struct5 *num2;
struct6 *num3;
{
structl *duml;
struct5 *dum2;
struct6 *dum3;
int iljlkll;
duml = numl;
duma = num2;
dum3 = num3;
register and tracking registers and status logg\nrr);
printf ("\nrr)
;
printf ("\nW);
printf ( " the input registers ( 8 - 0 )\nn) ;
printf ("\nn);
printf ("\nrr)
;
for(j=O;j<=7;j++)
(
printf ( " %d It,(duml+l)->bits[7-j] ) ;

1
printf ("\nu);
printf ("\nW);
for(j=O;j<=7;j++)
(
printf ( " %d It,(duml+2)->bits[7-j 1 )
printf ("\nu);
printf ("\nW);
printf (Ir the flag register \nw);
printf ("\nW);
printf ("\nrl)
;
printf(" the flag register is mpreg
for(i=O;i<=lO;i++)
+ 0 \nrr);
printf(" %d It,(duml+O)->bits[lo-i])
printf ("\nu);
printf ("\nrl)
;
printf(" the logging register for sub\nn);
for(i=l;i<=9;i++)
(
printf (Ir %d
, (dum3+2)->logg[9-i] ) ;
printf ("\nVr)
;
printf ("\nW);
printf (Ir the tracking registers \nw);
for(i=l;i<=g;i++)
{
==
1)
printf(I1 the tracking register number is %d and

the value of the address is %d \nI1,i, (dum2+i)->address
1;
printf ("\ntl)
;
for(j=O;j<=8;j++)
{
printf ( n %d 11,(dum2+i)->st-track[8-j ] )
printf (l1\nt1)
;
printf ("\nW);
1
1
void status print3(numl,num2,num3)

structl *nuGI;
struct4 *num2;
struct6 *num3 ;
{
structl *duml;
struct4 *dum2;
struct6 *dum3;
int ilj,kll;
duml = numl;
duma = num2 ;
dum3 = num3;
~rintf(~Iprinting
the pipeline register and flag
register and tracking registers and status logg\nI1);
printf ("\nV1)
;
printf ("\nn);
printf (I1 the input registers ( 8 - 0 )\nI1);
printf (I1\nl1)
;
printf (l1\nl1)
;
printf (I1 the input register mpreg + 3 \ntl)
;
for(j=O;j<=8;j++)
{
printf (I1 %d I l l (duml+3)->bits [8-j] ) ;

1
printf (ll\nw)
;
printf (I1\n1l)
;
printf ( " the input register mpreg + 4 \nI1);
for(j=O;j<=8;j++)
{
printf ( " %d If,(duml+4)->bits[8-j ] )
printf ("\n1I);
printf (I1\nw)
;
printf(" the flag register \nw);
printf ("\n1I);
printf (lt\nU)
;
printf(It the flag register is mpreg
for(i=O;i<=lO;i++)
+ 0 \nw);
printf(" %d w,(duml+~)->bits[lO-i]);
1
printf ("\nV1)
;
printf (I1\nM)
;
printf (I1 the logging register for add\nIf);
for(i=l;i<=lO;i++)
{
printf (I1 %d ",(dum3+3)->logg[9-i] ) ;

1
printf ("\nW);
printf (n\nn);
printf (It the tracking registers \nH);
for(i=l;i<=g;i++)
{
==
1)
printf(ll the tracking register number is %d and

the value of the address is %d \ntt,i, (dum2+i)->address
1;
printf ("\nW);
for(j=l;j<=8;j++)
{
printf (If %d If,(dum2+i)->st-track[8-j ] )
printf ("\nu);
printf (I1\n1')
;
1
1
1
Goid status print4 (numl,num2 ,num3)

structl *nuiil;
struct3 *num2;
struct6 *num3;
{
structl *duml;
struct3 *dum2;
struct6 *dum3;
int i,j,k,l;
duml = numl;
dum2 = num2;
dum3 = nurn3;
printf ("printing the pipeline register and flag
register and tracking registers and status logg\nIt);
printf ("\nI8);
printf ("\n8I);
printf ( " the input registers ( 8 - 0 )\ntv)
;
printf ("\nW);
printf ("\nW);
printf(IV the input register mpreg + 3 \nn);

for(j=O;j<=8;j++)
1
printf (Ig\ngt)
;
printf (I1\n");
printf ( " the input register mpreg + 4 \nV1)
;
for(j=O;j<=8;j++)
{
printf (IV %d ",(duml+4)->bits[8-j1 ) ;

1
printf ("\nu);
printf ("\nW);
printf (It the flag register \ntl)
;
printf ("\nW);
printf ("\nVt)
;
printf(lV the flag register is mpreg
for(i=O;i<=lO;i++)
+ 0 \n") ;
printf ( " %d ",(duml+O)->bits [lo-i]) ;

1
printf ("\nun)
;
printf ("\nW);
printf ( " the logging register for div\nvv)
;
for(i=l;i<=g;i++)
{
printf(" %d w,(dum3+4)->logg[9-i]);
1
printf ("\nW);
printf ("\nu);
printf (Iw the tracking registers \nw);
for(i=l;i<=g;i++)
{
==
1)

the value of the address is %d \nw,i, (dum2+i)->address
):
printf (n\nw);
for(j=l;j<=8;j++)
{
printf (It %d 'I,(dum2+i)->st-track[8-j ] )

1
printf ("\nu);
printf (tV\nll)
;
1
1
1
..........................................
..........................................
..........................................
/*******
Function Load Pipeline
*********/
..........................................
...................................
/***** 0. P.F indicator. ********/
/***** 1. Priority Flag. ********/
/***** 2. Stack Index.
********/
/***** 3. CCM Pointer.
********/
/***** 4. ADD Latency.
********/
/***** 5. SUB Latency.
********/
/***** 6. MULT Latency. ********/
/***** 7. DIV Latency.
********/
/***** 8. Priority Index. ********/
/***** 9. Local Index.
********/
...................................
structlload p i p e l i n e ( n u m 0 , n u m l , n u m 2 , n u m 3 , n u m 4 , n u m 5 ,
num6,num7,num8,num9,numl0,numll,numl2)
structl *numO; / * pointer to input registers */
structO *numl; /* pointer to cross collision matrices
struct7 *num2; / * pointer to input structure */
struct3 *num3; / * pointer to div trac
*/
struct4 *num4; / * pointer to mult trac */
struct5 *num5; / * pointer to add trac */
struct6 *num6; / * pointer to logg sheet */
struct9 *numlO; / * pointer to priority structure */
struct5 *numll;/* pointer to subtract trac */
struct3 *numl2;/* pointer to delta trac*/
int num7,num8,num9; / * registers */
*/
structl *tempo; / * pointer to input registers */

structO *tempi; / * pointer to cross collision matrices
*/
struct7 *temp2; / * pointer to input structure */

struct3 *temp3; / * pointer to div trac
*/
struct4 *temp4; / * pointer to mult trac */
struct5 *temp5; / * pointer to add trac */
struct6 *temp6; / * pointer to logg sheet */
struct9 *templo;/* pointer to priority structure*/
struct5 *templl;/* pointer to subtract trac */
struct3 *templ2;/* pointer to delta trac*/
i
n
t
i,j,k,l,priority
-flag,stack-index,matrix-index,look-ahead;
intpriority-index,additional-entry,divisional-entry,dis;
init key = 0;
delta flag = 0;
addition = 0;
subtraction = 0;
multiplication = 0;
division = 0;
temp0 = num0;
temp1 = numl;
temp2 = num2;
temp3 = num3;
temp4 = num4;
temp5 = num5;
temp6 = num6;
temp10 = numl0;
temp11 = numll;
temp12 = numl2;
priority-flag = (num0+0)->bits[l];/*loading the priority
flag */
stack index = (numO+O)->bits [a] ;/* loading the current
instruction location */
matrix index = (num0+0)->bits[3] ;/* loading the current
address of CCM */
priority index = (temp0+0)->bits[8];
look-ahead = stack index + 1;
/ * checking for any priority situation */
if (priority-flag == 1)
{
printf (I1 the priority flag is one and entering case

fnc\n1I);
switch((templ0
+ priority-index)->func)
case 4:
if((templ+matrix index)->smatrix.bits - r o w 1
[ (temp0+0)->bits [lo]1 == 0)
{
/*
/*
in here it will be determined wether div */

can be added to pipe with add or sub*/
printf(" div is posibble and checking to see if
additional functions are possible and main case is 4 \n1I);
case 1:
if((templ+matrix -index)->smatrix.bits-row3
(tempO+O)->bits[lo]1 == 0)
{
1
else
divisional entry = 1;
printf ( " divisional entry is l\ngl)
;
printf (It addition is also possible\nn);
printf(I1 though addition is the next
instruction no latency is available \n1I);
printf(" divisional entry is O\n1I);
1
break;
case 2:
[ (tempO+O)->bits[lo]1 == 0)
{
;
printf (I1 dTvisional entry is 2\nss)
p r i n t f (I1 s u b t r a c t i o n i s a l s o
possible\nw);
1
else
{
printf ( " though subtraction is the next
instruction no latency is available \nw);
printf ( " divisional entry is O\nsl)
;
1
break;
default :
printf ( " only division is possible \nI1);
printf ( " divisional entry is O\n1l);
break;
1
else
{
printf("no latency available to process p.s\nu);
(tempO+O)->bits[10]+=1;
printf(I1the next latency is
%d\nn,(tempO+O)->bits[10] ) ;
printf(" initialising the input registers to
0\n1l);
init key =l;
for(T=o;ie=8;i++)
{
1
1
break;
case 5:
(tempO+l)->bits[i] = 0;
(temp0+2)->bits[i] = 0;
(temp0+3)->bits [i] = 0;
(temp0+4)->bits[i] = 0;
printf(" the case is 5 and delta is being loaded

in priority stack is 1 \n1I);
for(j=O; j<=7;j++)
{
registers
*/
/*
this part will initiate the trackng
dis = (temp10 + priority index)->address;

(temp12 + dis)->st-track711 =1;
/*
initialising the registers
*/
(temp0+0)->bits[8]+=1;
delta-flag-= 1;
*/
/*
re initialising the priority index
/*
checking and initialising the priority flag
*/
printf ( " the priority-flag is init to 0 \nit);

(temp~+O)
->bits[1] = 0;
setting of priority index
(ternp~+O)->bits[O]+=2;
/*
*/
/*
printing of the results of case 5
*/

register\nI1);
printf ("\nu);
printf ("\n1I);
printf ( I 1 the input registers temp0 + 3 ( 7 0)\nll);
printf ("\nu);
printf (l1\nl1)
;
f o r ( j = O ; j < = 7 ;j++)
{
printf
(It
%d ",(temp0+3)- > b i t s [ 7 - j ] ) ;
p r i n t f ( t v \ n w;)
p r i n t f ("\nu) ;
p r i n t f ( " t h e i n p u t r e g i s t e r s temp0
0 ) \nl1) ;
(8
p r i n t f ( " \ n W );
p r i n t f (I1\nw);
for(j=O;j<=8;j++)
{
p r i n t f ( " %d ",(temp0+4)- > b i t s [ 8 - j ] ) ;
p r i n t f ( l l \ n l l );
p r i n t f ("\n1I) ;
p r i n t f (I1 t h e f l a g r e g i s t e r 8
p r i n t f ( " \ n W );
p r i n t f ( " \ n W );
for(i=O;i<=8;i++)
{
printf
(I1
%d
I!,
O\nfv);
(temp0+0)- > b i t s [ 8 - i ] ) ;
p r i n t f ( " \ n W );
p r i n t f ("\ntt);
p r i n t f ( " t h e logging r e g i s t e r s 9
p r i n t f ( " \ n W );
p r i n t f ( " \ n W );
f o r ( i = li ;< = 9 ;i + + )
{
printf
(I1
O\nw);
%d ",(temp6+5)- > l o g g [ 9 - i ] ) ;
p r i n t f ( l t \ n l l );
p r i n t f ( t f \ n w;
)
p r i n t f ( " the tracking registers \nw);
p r i n t f ( " \ n W );
p r i n t f ("\nu) ;
for(i=l;i<=g;i++)
{
i f ( (temp6+5)- > l o g g [ i ] == 1)
{
p r i n t f ( " t h e t r a c k i n g r e g i s t e r number i s %d and

t h e v a l u e of t h e a d d r e s s i s %d \n", i , ( t e m p l 2 + i )- > a d d r e s s ) ;
p r i n t f ( l 8 \ n N;
)
p r i n t f ( " \ n W );
for(j=O;j<=8;j++)
{
p r i n t f ( " %d
1
1
1
I!,
( t e m p l 2 + i )->st-t r a c k [ 8 - j ] ) ;
printf ("\r~*~) ;
printf ("\nW);
break;
1
4
/*
*/
this section below assigns the values for case
switch (divisional-entry)
{
case 0:
/ * the division is being loaded */
printf (I1 t h e latency i s a v a i l a b l e for
iteration\nI1);
/ * loading the arguments into the stage div */
for(j=O;j<=7;j++)
{
( t e m p 0 + 3 ) - > b i t s [ j ]
(templO+priority-index)->num-one[j];
1
for(j=O;j<=8;j++)
( t e m p 0 + 4 ) - > b i t s [ j ]
(templO+priority-index)->num-two[j];
1
( t e m p O + O ) - > b i t s [ 3 ]
(templ+matrix index)->sdirection.div -latency
[(temp0+0)->bits[To]];
/ * this part will initiate the trackng
registers */
(temp3 + dis)->st-track[i] =1;
(temp0+0)->bits[lo] = 0;
printf(I1 the division status is printed below in
case O\n iteration \nt1);
status print4(tempO,temp3,temp6);
division = 1;
break;
case 1:
the addition is being loaded */
printf ( " the latency is available for
iteration\nl1);
/ * loading the arguments into the stage add */
for(j=O;j<=7;j++)
/*
( t e m p O + l ) - > b i t s [ j ]
(temp2+stack-index)->num-one[j+l];
( t e m p 0 + 2 ) - > b i t s [ j ]
(temp2+stack-index)->num-two[j+l] ;
( t e m p O + O ) - > b i t s [ 3 ]
(templ+matrix index)->sdirection.div -add
[ (temp0+0)->bits[lo71;
addition = 1;
registers */
for(i=l;i<=9;i++)
{
==
0)
(temp6+1)->logg[i] = 1;
(temp5+i)->st track[l]=l;
( t e m p 5 q i ) - > a d d r e s s
(temp2+look-ahead)->location;
i=lO;
1
1
printf(lV the addition status is printed below in
case 1 in iteration\nw);
status printl(tempOItemp5,temp6);
/ * the-division is being loaded */
printf (I1 the latency is available in iteration
\nW);
/* loading the arguments into the stage div */
for(j=O;j<=7;j++)
( t e m p O + O ) - > b i t s [ 3 ]
( t e m p l + m a t r i x i n d e x ) - > s d i r e c t i o n . d i v- a d d
registers */
(temp3 + dis)->st-track[i] =l;
printf(" the division status is printed below in
;
case 1 in iteration\nl@)
status print4(tempOItemp3,temp6);
division = 1;
break;
case 2:
/* the subtraction is being loaded */

printf(If the latency is available \nn);
/* loading the arguments into the stage
sub
for(j=O;j<=7;j++)
*/
( t e m p O + l ) - > b i t s [ j ]
(temp2+stack-index)->num-one [ j+l] ;
( t e m p 0 + 2 ) - > b i t s [ j ]
(temp2+stack-index)->num-two[j+l];
1
( t e m p O + O ) - > b i t s [ 3 ]
[ (tempO+O)->bits[lo71;
subtraction = 1;
registers */
for(i=l;i<=g;i++)
{
if((temp6+2)->logg[i] == 0)
{
(temp6+2)->logg[i] = 1;
(templl+i)->st track[l]=l;
( t e m p l l q i ) - > a d d r e s s
i=10;
1
1
printf(" the subtraction status is printed below
;
in case 2 in iteration\nff)
status~print2(temp0,templl,temp6);
/* the division is beinq loaded */

printf (If the latency is-available.\nff)
;
for(j=O;j<=7;j++)
{
( t e m p 0 + 3 ) - > b i t s [ j ]
(templO+priority-index)->num-one[j];
1
for(j=O;j<=8;j++)
{
( t e m p 0 + 4 ) - > b i t s [ j ]
(templO+priority-index)->num-two[j];
1
( t e m p O + O ) - > b i t s [ 3 ]
registers */
dis = (temp10 + priority index)-Baddress;
(temp3 + dis)->st-track[i] =l;
(temp0+0)->bits [lo] = 0;
(temp0+0)->bits[8]+=1;
case 12\nl1);
status print4(tempO,ternp3,temp6);
division = 1;
break;
case 4:
break;
1
else
{
/*
/*
This condition indicates no division is awaiting

any iterations
*/
*/
checking for additon or subtraction */

now checking for the type of loading */
case 1 -> add only
*/
case 2 -> add and multiplication
*/
case 3 -> add and division
*/
case 4 -> sub only
*/
case 5 -> sub and multiplication
*/
case 6 -> sub and division
*/
case 7 -> mult only
*/
case 8 -> mult and addition
*/
case 9 -> mult and subtraction */
case 10 -> div only
*/
case 11 -> div and addition
*/
case 12 -> div and subtraction */
printf(" priority -flag is zero and entering case
fnc\nw);
switch ( (temp2+stack-index)->func)
{
case 1:
/*
/*
in here it will be determined wether add */

can be added to pipe with mult or div */
if((templ+matrix index)->smatrix.bits - row3
(temp0+0)->bits[4] 1 == 0)
{
printf ( " add is posibble and checking to see if

additional functions are possible and main case is 1 \riff);
switch((temp2+look -ahead)->func)
{
case 3 :
[ (temp0+0)->bits[4]1 == 0)
additional entry = 2;
printf (I1 additional entry is 2\n1I);
printf(I1 m u l t i p l i c a t i o n i s a l s o
possible\nI1);
1
else
{
printf ( " additional entry is l\nV1)
;
printf(" though multiplication is the
next instruction no latency is available \nI1);
)
break;
case 4:
[ (temp0+0)->bits [4]1 == 0)
{
printf (11 additional entry is 3\nw);
printf (I1 division is also p~ssible\n~~)
;
else
{
printf (I1 additional entry is l\nI1);
printf(" though division is the next
instruction no latency is available \nI1);
1
break;
default:
printf ( " only addition is possible \nI1);
break;
1
else
{
printf (I1 no latency is available for addition\nI1);
(temp0+0)->bits[10]+=1;
printf (I1 the next latency to look for is %d
\nI1,(tempO+O)->bits [4]) ;
if((tempo+E)->bits[4]==8)
{
p r i n t f (I1 no l a t e n c y is a v a i l a b l e and
reinitialising to matrix 3 \nI1);
(temp0+0)->bits[4] = 0;
1
1
break;
case 2:
in here it will be determined wether sub */

can be added to pipe with mult or div */
if((templ+matrix - index)->smatrix.bits-row3
(temp0+0)->bits[5]1 == 0)
/*
/*
printf(Iv sub is posibble and checking to see if

additional functions are possible and main case is 2 \nvv);
case 3:
if((templ+matrix-index)->smatrix.bits-row2
[ (temp0+0)->bits[5] 1 == 0)
{
printf ( " additional entry is 5\nvf)
;
printf(Iv m u l t i p l i c a t i o n i s a l s o
possible\nw);
1
else
{
printf ( n additional entry is 4\nvv)
;
printf(Iv though multiplication is the
next instruction no latency is available \nvv);
1
break;
case 4:
[ (temp0+0)->bits[5] 1 == 0)
{
~ r i n t f (additional
~
entry is 6\nn);
printf ( " division is also possible\nvv)
;
1
else
{
printf (Iv additional entry is 4\nvv)
;
printf(Iv though division is the next
instruction no latency is available \nvv);
1
break;
default:
additional-entry = 4;
printf(I1 only subtraction is possible

printf (I1 additional entry is 4\nH) ;
break;
1
else
p r i n t f (I1 n o l a t e n c y i s a v a i l a b l e f o r
subtraction\nw);
(temp0+0)->bits[10]+=1;
printf ( I 1 the next latency t o look for is %d
\nV1,
(tempO+O)->bits[5]) ;
if((tempo+z)->bits[5]==8)
{
p r i n t f (I1 n o l a t e n c y i s a v a i l a b l e a n d
reinitialising to matrix 3 \nu);
(temp0+0)->bits[5] = 0;
(temp0+0)->bits[3] = 3;
1
1
break;
case 3:
/*
/*
in here it will be determined wether mult

if((templ+matrix - index)->smatrix.bits
[ (temp0+0)->bits[6]1 == 0)
*/
-row2
printf(I1 mult is posibble and checking to see if

additional functions are possible and main case is 3 \nW);
switch ( (temp2+look-ahead)->func)
{
case 1:
[ (temp0+0)->bits[6]1 == 0)
{
printf ( " additional entry is 8\nm1)
;
printf (I1 addition is also possible\ntl)
;
1
else
{
printf(" though addition is the next
instruction no latency is available \nn);
printf (Is additional entry is 7\nN);
break;
case 2:
if ( (templ+matrix-index)->smatrix.bits-row3
[ (temp0+0)->bits[6]] == 0)
{
printf(" additional entry is 9\nw);
p r i n t f (I1 s u b t r a c t i o n i s a l s o
possible\nw);
1
else
{
printf ( " though subtraction is the next
printf (I1 additional entry is 7\nw);
break;
default:
printf ( " only multiplication is possible
print(" additional entry is 7\nw);
break;
1
else
{
p r i n t f (I1 n o l a t e n c y i s a v a i l a b l e f o r
multiplication\nw);
(temp0+0)->bits[6]+=1;
printf ( " the next latency to look for is % d
\n", (tempO+O)->bits[6] ) ;
if((tempO+c)->bits[6]==8)
{
p r i n t f (I1 n o l a t e n c y i s a v a i l a b l e a n d
reinitialising to matrix 2 \nv);
(temp0+0)->bits [6] = 0;
(tempO+O)->bits[3] = 2;
1
1
break:
case 4:
in here it will be determined wether div */

if((templ+matrix -index)->smatrix.bits -row1
[ (temp0+0)->bits[7]1 == 0)
/*
/*
{
printf(" div is posibble and checking to see if

;
additional functions are possible and main case is 4 \ntt)
switch( (temp2+(look-ahead+l)) ->func)
{
case 1:
if((templ+matrix-index)->smatrix.bits-row3
[ (temp0+0)->bits[7]1 == 0)
{

printf ( " additional entry is ll\nM);
printf (It addition is also possible\nn);
1
else
{

printf(It t-hough addition is the next
instruction no latency is available \ntt);
printf (It additional entry is 10\ntt)
;
1
break;
case 2:
[ (temp0+0)->bits[7]1 == 0)
{

printf (It aaditional entry is 12\11");
p r i n t f (It s u b t r a c t i o n i s a l s o
possible\nw);
1
else
{
additional-entry = 10;
printf(" though subtraction is the next
printf ( " additional entry is 10\ntt)
;
1
break;
default:
;
printf ( " only division is possible \ntt)
printf(" additional entry is 10\ntt);
break;
1
else
{
printf(It no latency is available for division\ntt);
(temp0+0)->bits[lo]+=I;
printf ( " the next latency to look for is %d
\nvv,
(tempO+O)->bits[7] ) ;
if ( (tempo+o)->bits[7]==8)
{
p r i n t f (Iv n o l a t e n c y i s a v a i l a b l e a n d
reinitialising to matrix 1 \nvv)
;
(temp0+0)->bits[7] = 0;
(temp0+0)->bits[3] = 1;
1
break;
case 5:
printf(" the case is 5 and delta is being loaded
wherein the priority index is O\nN);
for(j=O;jc=7;j++)
( t e m p 0 + 4 ) - > b i t s [ j ]
(temp2+stack-index)->num-two[j];
1
if (re-adjust == 1)
printf ("\nvv)
;
printf(Iv the re-adjust is recognised as
1 and re-adjust is assigned 0 and stack index is doubally
;
incremented\nvv)
printf ("\nu);
re-adjust = 0;
1
else
{
(temp0+0)->bits[2]+=1;
printf (Iv\nw)
;
printf(Iv the re-adjust is recognised as
0 and the instruction stack flag is singlely
incremented\nvv)
;
printf (Iv\nu)
;
1
delta-flag = 1;
registers */
for(i=l;ic=9;i++)
(temp6+5)->logg[i] = 1;
(templ2+i)->st track[l]=l;
( t e m p 1 2 i ) - > a d d r e s s
(temp2+stack-index)->location;
i=lO;
1
1
/ * printing of the results of case 5 */
printf ("printing the pipeline register and flag
register\nl1);
printf ("\nIf);
printf ("\nlt);
printf (I8 the input registers temp0 + 3 (7 0)\nt');
printf (lt\nll)
;
printf (I1\nf1)
;
for(j=O;j<=7;j++)
{
printf ( " %d ", (temp0+3)->bits[7-j] ) ;

1
printf ( ll\nvl)
;
printf (I1\nt1)
;
printf (I1 the input registers temp0 + 4 (8
0)\nl1);
printf (I1\nw)
;
printf (l1\nIf)
;
for(j=O;j<=8;j++)
{
printf ( " %d ",(temp0+4)->bits [8-j] ) ;

1
printf (ll\nN)
;
printf ("\nl1);
printf (If the flag register 8 - O\nft)
;
printf (vt\nN)
;
printf ("\nu);
for(i=O;i<=8;i++)
{
printf ( " %d It,(temp0+0)->bits[8-i] ) ;

1
printf (It\nw)
;
printf ("\nW);
printf(" the logging registers 9 - O\nW);
printf (ll\nll)
;
printf ("\nu);
for(i=l;i<=g;i++)
{
printf ( " %d !I,(temp6+5)->logg[9-i]) ;

1
printf ("\nvr)
;
printf ("\nit);
printf (It the tracking registers \nN);
printf ("\nW);
printf ("\nW);
for(i=l;i<=g;i++)
{
if ( (temp6+5)->logg[i] == 1)
{

the value of the address is %d \n~lil(temp12+i)->address);
printf ("\nW);
printf (n\ngg)
;
for(j=O;j<=8;j++)
{
printf ( " %d ",(templ2+i)->st-track[8-j ]) ;

1
1
1
printf ("\nl@)
;
printf ("\nW);
break;
1
switch(additiona1-entry)
{
case 1:
/* the addition is being loaded */
printf(Ig the additional entry is 1 and addition
only \nl');
/* loading the arguments into the stage add */
for(j=O;j<=7;j++)
{
( t e m p O + l ) - > b i t s [ j ]
(temp2+stack-index)->num-one [j+l];
( t e m p 0 + 2 ) - > b i t s [ j ]
(temp2+stack-index)->num-two [ j+l] ;
( t e m p O + O ) - > b i t s [ 3 ]
(templ+matrix index)->sdirection.add -latency
[(temp0+0)->bits[z]];
(temp0+0)->bits [4] = 0;
addition = 1;
registers */
for(i=l;i<=g;i++)
{
== 0)
(temp6+1)->logg[i] = 1;
( t e m p 5 T i ) - > a d d r e s s
i=lO ;
1
1
printf(" the addition status is printed below in

case l\nIt);
status-printl(ternpOttemp5,temp6);
break;
case 2:
/ * the addition is being loaded */

printf(" the latency is available \nn);
/ * loading the arguments into the stage
add
*/
for(j=O;j<=7;j++)
{
( t e m p O + l ) - > b i t s [ j ]
( t e m p 0 + 2 ) - > b i t s [ j ]
1
( t e m p O + O ) - > b i t s [ 3 ]
(templ+matrix index)->sdirection.mult -add
[ (temp0+0)->bits [4 ;
(temp0+0)->bits[2]+=1;
addition = 1;
registers */
for(i=l;i<=g;i++)
r]
==
0)
(temp6+1)->logg[i] = 1;
( t e m p 5 T i ) - > a d d r e s s
i=lO;
1

case 2\nw);
status printl(tempOttemp5,temp6);
/* the-multiplication is being loaded */
printf ( " the latency is available \nll);
/ * loading the arguments into the stage mult */
for(j=O;j<=7;j++)
( t e m p O + O ) - > b i t s [ 3 ]
[ (tempO+O)->bits[4
registers */
for(i=l;i<=9;i++)
n;
1
printf(" the multiplication status is printed
below in case 2\nn);
status print3(tempO,temp4,ternp6);
multiplication = 1;
break;
case 3 :
the addition is being loaded */
printf(" the latency is available \nl');
/ * loading the arguments into the stage add
for(j=O;j<=7; j++)
/*
*/
( t e m p O + l ) - > b i t s [ j ]
( t e m p 0 + 2 ) - > b i t s [ j ]
1
( t e m p O + O ) - > b i t s [ 3 ]
[ (tempO+O)->bits[4]3
(temp0+0)->bits[2]+=I;
addition = 1;
registers */
for(i=l;i<=9;i++)
{
==
0)
1
1
printf (I1 the addittion status is printed below in

case 3\nI1);
status printl(tempOItemp5,temp6);
/ * the-division is being loaded */
hrintf (I1 the latency is-available.\n") ;
for(j=O;j<=7;j++)
{
( t e m p 0 + 3 ) - > b i t s [ j ]
(temp2+look-ahead)->num-one [ j+l] ;
for(j=O;j<=8;j++)
{
( t e m p 0 + 4 ) - > b i t s [ j ]
(temp2+look-ahead)->num-two [ j ];
1
( t e m p O + O ) - > b i t s [ 3 ]
[ (temp0+0)->bits[4]>
(temp0+0)->bits[4] = 0;
(temp0+0)->bits[2]+=1;
registers */
for(i=l;i<=g;i++)
{
== 0)
(temp6+4)->logg[i] = 1;
(temp3+i)->st track[l] =l;
( t e m p 3 T i ) - > a d d r e s s
i=10;
1
1

case 3\nN);
status print4(tempOItemp3,temp4);
division = 1;
break;
case 4:
/ * the subtraction is being loaded */

printf(" the latency is available \nI1);
for(j=O;j<=7;j++)
{
( t e m p O + l ) - > b i t s [ j ]
( t e m p 0 + 2 ) - > b i t s [ j ]
( t e m p O + O ) - > b i t s [ 3 ]
(templ+matrix index)->sdirection.add -latency
[(temp0+0)->bits[%]];
(temp0+0)->bits[5] = 0;
subtraction = 1;
registers */
for(i=l;i<=g;i++)
{
==
0)
(temp6+2)->logg[i] = 1;
( t e m p l l q i ) - > a d d r e s s
i=10;
1
1
printf(It the subtraction status is printed below
in case 4\nw) ;
status-print2(tempOftempll,temp6);
break;
case 5:

printf(I1 the latency is available \nw);
/ * loading the arguments into the stage
sub
*/
for(j=O;j<=7;j++)
{
( t e m p O + l ) - > b i t s [ j ]
( t e m p 0 + 2 ) - > b i t s [ j ]
(temp2+stack-index)-mum-two [j+l];
1
( t e m p 0 + 0 ) - > b i t s [ 3 ]
templ+matrix index)->sdirection.mult -add
(tempO+O)->bits[5r];
(temp0+0)->bits[10] = 0;
subtraction = 1;
registers */
for(i=l;i<=g;i++)
{
==
0)

in case 5\nw);
status print2(tempO,templl,temp6);
printf(" the latency is available \nw);
/* loading the arguments into the stage mult */
for(j=O;j<=7;j++)
{
( t e m p 0 + 3 ) - > b i t s [ j ]
(temp2+look-ahead)->num one[j+l];
( t e m p O + 4 - > b i t s [ j ]
(temp2+look-ahead)->num-two[j+l] ;
( t e m p O + O ) - > b i t s [ 3 ]
[ (tempO+O)->bits[5r] ;
(temp0+0)->bits[5] = 0;
(temp0+0)->bits[2]+=1;
registers */
for(i=l;i<=9;i++)
{
==
0)
(temp6+3)->logg[i] = 1;
( t e m p 4 T i ) - > a d d r e s s
i=lO;
1
1
printf (It the multiplication status is printed
below in case 5\nw);
status print3(tempO1temp4,temp6);
multiplication = 1;
break;
case 6:
/ * loading the arguments into the stage sub
for(j=O;j<=7;j++)
{
( t e m p O + l ) - > b i t s [ j ]
(temp2+stack-index)->num one [j+l];
( t e m p 0 + 2 ) - > b i t s [ j ]
*/
-
( t e m p O + O ) - > b i t s [ 3 ]
t e m p l + m a t r i x - i n d e x ) - > s d i r e c t i o n . d i v -add
(temp0+0)->bits[5] 1 ;
(temp0+0)->bits[2]+=1;
(temp0+0)->bits[10] = 0;
subtraction = 1;
registers */
for(i=l;i<=g;i++)
{
==
0)
(temp6+2)->logg[i] = 1;
( t e m p l l - t i )- > a d d r e s s
i=10;
1

in case 6\nlt);
status print2(tempO,templl,temp6);
/* the-division is being loaded */
printf(" the latency is available \nu);
for(j=O;j<=7;j++)
{
( t e m p 0 + 3 ) - > b i t s [ j ]
1
for(j=O;j<=8;j++)
{
( t e m p 0 + 4 ) - > b i t s [ j ]
(temp2+look-ahead)->num-two[j];
( t e m p O + O ) - > b i t s [ 3 ]
[ (temp0+0)->bits[5]>
(temp0+0)->bits[2]+=1;
registers */
for(i=l;i<=g;i++)
{
==
0)
(temp6+4)->logg[i] = 1;
( t e m p 3 q i ) - > a d d r e s s
i=10;
1
1
case 6\nl1);
division = 1;
break;
case 7:
/* the multiplication is being loaded */
printf(" the latency is available \ntt);
for(j=O;j<=7;j++)
{
( t e m p 0 + 3 ) - > b i t s [ j l
(temp2+stack-index)->num-one[j+l] ;
( t e m p 0 + 4 ) - > b i t s [ j ]
(temp2+stack-index)-mum-two [ j+l] ;
1
( t e m p O + O ) - > b i t s [ 3 ]
(templ+matrix index)->sdirection.mult -latency
[ (tempO+O)->bits[%] ] ;
(temp0+0)->bits[6] = 0;
(temp0+0)->bits[lO] = 0;
registers */
for(i=l;i<=9;i++)
{
==
0)
(temp6+3)->logg[i] = 1;
( t e m p 4 q i ) - > a d d r e s s
i=lO;
1
1
printf (I1 the multiplication status is printed
below in case 7\nw);
status print3(ternpO,temp4,temp6);
multiplication = 1;
break;
case 8:

printf(" the latency is available \nM);
for(j=O;j<=7;j++)
{
( t e m p O + l ) - > b i t s [ j ]
(temp2+look-ahead)->num-one [ j+1] ;
( t e m p 0 + 2 ) - > b i t s [ j ]
(temp2+look-ahead)->num-two[j+l];
1
( t e m p O + O ) - > b i t s [ 3 ]
[ (temp0+0)->bits [61];
(temp0+0)->bits[lO] = 0;
addition = 1;
registers */
for(i=l;i<=g;i++)
{
==
0)
(temp6+1)->logg[i] = 1;
( t e m p 5 T i ) - > a d d r e s s
i=10 ;
1

case 8\nw);
status printl(ternpO,temp5,temp6);
/ * the-multiplication is being loaded */
printf (It the latency is available \nI1);
for(j=O;j<=7;j++)
{
( t e m p 0 + 3 ) - > b i t s [ j ]
(temp2+stack-index)->num-one [ j+1] ;
( t e m p 0 + 4 ) - > b i t s [ j ]
(temp2+stack-index)->num-two[j+l] ;
1
printf ("\nW);
printf ("\nW);
printf(" looking at temp0 + 3 in case 8 \nw);
printf ("\n1I);
printf ("\nW);
for(j=O;jc=7;j++)
{
printf (I1 %d 11,(temp0+3)->bits[ j ]) ;

1
printf (n\nv);
printf (I1\nn)
;
printf(It looking at temp0 + 4 in case 8 \nI1);
printf ("\nu);
printf (!'\nu);
for(j=O;j<=7;j++)
printf (It %d ",(temp0+4)->bits [ j I ) ;

1
printf ("\nu);
printf (I1\nw);
( t e m p 0 + 0 ) - > b i t s [ 3 ]
[ (temp0+0)->bits [
6
]
;
(temp0+0)->bits[6] = 0;
(temp0+0)->bits[2]+=1;
registers */
for(i=l;i<=g;i++)
{
if ( (temp6+3)->logg[i] == 0)
{
(temp6+3)->logg[i] = 1;
( t e m p 4 q i ) - > a d d r e s s
i=lO;
1
1
printf (It the multiplication status is printed

below in case 8\nu);
multiplication = 1;
break;
case 9:

printf ( " the latency is available \nV1)
;
/* loading the arguments into the stage sub */
for(j=O; j<=7;j++)
{
( t e m p O + l ) - > b i t s [ j ]
( t e m p 0 + 2 ) - > b i t s [ j ]
(temp2+look-ahead)->num-two [ j+1] ;
1
( t e m p O + O ) - > b i t s [ 3 ]
( t e m p l + m a t r i x i n d e x ) - > s d i r e c t i o n . m u l t -add
[ (tempO+O)->bits[
6
]
;
(tempO+O)->bits[lO] = 0;
subtraction = 1;
registers */
for(i=l;i<=g;i++)
{
{
==
0)

in case 9\nw);
status print2(tempOftempll,temp6);
printf(" the latency is available \ntt);
for(j=O;j<=7;j++)
{
( t e m p 0 + 3 ) - > b i t s [ j ]
(temp2+stack-index)->num-one[ j+l] ;
( t e m p 0 + 4 ) - > b i t s [ j ]
1
( t e m p O + O ) - > b i t s [ 3 ]
[ (tempO+O)->bits [ 6 n ;
(temp0+0)->bits[6] = 0;
registers */
for(i=l;i<=g;i++)
{
if ( (temp6+3)->logg[i] == 0)
{
(temp6+2)->logg[i] = 1;
( t e m p 4 q i ) - > a d d r e s s
i=lO;
1
printf(" the multiplication status is printed
below in case 9\nn);
status print3(tempOftemp4,temp6);
m~ltiplication= 1;
break;
case 10:
the division is being loaded */
printf (lt the latency is available \n1I);
/ * loading the arguments into the stage div
for(j=O;j<=7;j++)
/*
( t e m p 0 + 3 ) - > b i t s [ j ]
*/
-
(temp2+stack-index)-mum-one[j+l]
for(j=0;j<=8;j++)
( t e m p O + O ) - > b i t s [ 3 ]
(templ+matrix index)->sdirection.div -latency
[ (temp0+0)->bits[?I
];
(temp0+0)->bits[7] = 0;
(temp0+0)->bits[10] = 0;
registers */
for(i=l;i<=g;i++)
{
==
0)
(temp6+4)->logg[i] = 1;
( t e m p 3 7 i ) - > a d d r e s s
i=lO ;
1
1
case 10\nI1);
division = 1;
break;
case 11:

printf (I1 the latency is available \nP1)
;
/* loading the arguments into the stage add */
for(j=0;j<=7;j++)
{
( t e m p O + l ) - > b i t s [ j ]
(temp2+(look-ahead+l)) - m u m one [j+l];
( t e m p o - + 2 ) - > b i t s [ j ]
(temp2+(look-ahead+l)) -mum-two[j+l] ;
1
( t e m p O + O ) - > b i t s [ 3 ]
templ+matrix index)->sdirection.div -add
(temp0+0)->bits[7]7;
re adjust = 1;
ifTre-adjust == 1)
{
printf (Ig\n1l)
;
printf ( " re-adjust has been assigned one
in case 1l\nl1);
printf (l1\nl1)
;
registers
*/
1
addition = 1;
for(i=l;i<=g;i++)
{
==
0)
(temp6+1)->logg[i] = 1;
( t e m p 5 T i ) - > a d d r e s s
(temp2+(look-ahead+l))->location;
i=10;
1
1
case ll\nH);
status printl(ternpO,temp5,temp6);
/* the-division is being loaded */
printf (I1 the latency is available \nI1);
for(j=O;j<=7;j++)
{
( t e m p 0 + 3 ) - > b i t s [ j ]
(temp2+stack-index)->num-one[j+l] ;
1
for(j=O;j<=8;j++)
( t e m p 0 + 4 ) - > b i t s [ j ]
(temp2+stack-index)->num-two [ j] ;
1
( t e m p 0 + 0 ) - > b i t s [ 3 ]
(templ+matrix-index)->sdirection.div-add
[ (temp0+0)->bits[7] 1 ;
registers */
for(i=l;i<=g;i++)
{
{
== 0)
(temp6+4)->logg[i] = 1;
( t e m p 3 T i ) - > a d d r e s s
i=lO ;
1
1
case ll\nw);
status print4(tempOtternp3,temp6);
division = 1;
break;
case 12:

/* loading the arguments into the stage
sub
*/
for(j=O;j<=7;j++)
{
( t e m p O + l ) - > b i t s [ j ]
(temp2+(look-ahead+l)) ->num one[j+l] ;
( t e m p o e + 2 ) - > b i t s [ j ]
(temp2+(look-ahead+l)) ->num-two[j+l] ;
1
( t e m p O + O ) - > b i t s [ 3 ]
[ (tempO+O)->bits[7]>
re adjust = 1;
if(re -adjust == 1)
{
printf ("\nW);
printf ( " re-adjust has been assigned one
;
in case 12\ntt)
registers
*/
printf ("\ntt)
;
1
subtraction = 1;
for(i=l;i<=g;i++)
{
==
0)
(temp6+2)->logg[i] = 1;
( t e m p l l ? i ) - > a d d r e s s
(temp2+(look-ahead+l))->location;
i=lO;
1
1
in case 12\11"):
status print2(tempOttempll,temp6);
/* thedivision is being loaded */
printf ( " the latency is available \ntt)
;
for(j=O;j<=7;j++)
{
( t e m p 0 + 3 ) - > b i t s [ j ]
1
for(j=O;j<=8;j++)
1
( t e m p 0 + 4 ) - > b i t s [ j ]
(temp2+stack-index)->num-two[j];
1
( t e m p O + O ) - > b i t s [ 3 ]
[ (tempO+O)->bits[7]>
(temp0+0)->bits [7] = 0;
registers */
for(i=l;i<=g;i++)
{
==
0)
1
case 12\11");
status print4(ternpO,temp3,temp6);
division = 1;
break;
case 0:
if(de1ta -flag
== 0)
printf(I1 the additional entry = 0 and delta flag

=
0\n1l);
/*
init the input reg to the pipe to 0

init key = 1;
for(T=o;j<=8;j++)
{
(tempO+l)->bits [ j ] =
(temp0+2)->bits [ j ] =
(temp0+3)->bits[ j] =
(temp0+4)->bits [ j ] =
1
1
break;
1
1
return (*numO);
0;
0;
0;
0;
*/
.................................
/****** Function Set Logg ******/
/*******************%***********/
struct6 set logg(num1)
struct6 *numl;
{
int iljlkfl;
struct6 *tempi;
temp1 = numl;
/* this functions checks to see */
/ * wether any of the process loggs
/ * are empty */
for(i=l;i<=E~;i++)
{
*/
k = 0;
for(j=O;j<=9;j++)
(
1
if (k
(
k = k I
==
(templ+i)->logg[j ]
0)
(templ+i)->logg stat = 0;
printf ( " logg stat is made 0 \nll);
1
else
{
(templ+i)->logg stat = 1;
printf ( " logg stat is made 1 \nw);
1
1
return (*numl);
main ( )
{
int one,twolifj,kll,mfv;
FILE *inptr;
FILE *read ptr;
int i~dexftestlx,enough;
index = 0;
test = 0;
x = 0;
enough = 0;
memory ptr = &memory;
dstacki ptr = &decode stackl;
dstack2-ptr = &decode-stack2;
ilatch ptr = &iunit latches;
inhold-ptr
= &internal-holders;
statusu-ptr = &status unit;

gp-ptr = &gp-register?
sr ptr = &register sr;
pgmptrl = &pgm counterl;
pgm_ptr2 = &pgm-counter2;
isunit ptr = &isunit latch;
sstatug ptr = &stream status;
picstatus-ptr = &picqueue status;
eacstatus-ptr = &eacqueue-status;
(sstatus ptr)->picqueue full
(sstatus-ptr)->eacqueue-full
/ * enter-the instruction
par pointer = &par product;
argi-pointer = &argumentl;
*/
=
=
0;
0;
arg2 pointer = &argument2;

lat pointer = &latches;
trans pointer = &transfer;
delay-ptr = &delay;
mpreg-ptr = &multipurpose-reg;
divflow ptr = &div follow;
deltaflow ptr = &delta track;
multflow ptr = &mult follow;
addflow ptr = &add follow;
subflow-ptr = &sub-follow;
prlogg-ptr = &process logg;
prstack-ptr = &priority stack;
copy one = argl pointer?
copy-two = arg2pointer;
copy-three = par pointer;
copy-four = lat pointer;
copy-five = trans pointer;
copy-six = delay ptr;
/ * intialising the pointers to the variables
for(i=l;i<=20;i++)
{
ptr argmntl[i] = &arg one[i][O];

ptr-argmnt2[i]
= &arg-two[i][O];
-
1
ptr op = &op-code;
instack ptr = &input stack;
outstack ptr = &output stack;
bin pointer = &binary matrix;
/ * Xnitialising all the flags to zero
printf ("initialising the flags \nl1);
for(i=O;i<=8;i++)
{
(mpreg-ptr+O) ->bits[i]
0;
*/
*/
1
(mpreg ptr+O)->bits [3] = 3 ;
(mpregAptr+O)->bits [2] = 1;
(mpreg-ptr+0)->bits[o] = 2 ;
re-adjust = 0;
/ * reading in of the instruction stack and control
structures */
/ * reading of control.dat */
printf(" enter the number of instructions in the stack \nu);
s ~ a n f ( ~ % d ~ , & sptr);
tk
inptr = fopen("control.datw, "rW);
if (inptr == (FILE *)NULL)
{
printf (I1 error in reading operation \nV1)

;
exit (1);
1
fread (bin-pointer, sizeof(struct collision-matrix), 89,
inptr) ;
fclose (inptr);
/ * reading of the instr.dat */
/ * reading of the instruction stack */
read ptr = fopen ("instr.datI1, "rgl)
;
if(read-ptr == (FILE *)NULL)
{
p r i n t f (I1 e r r o r i n r e a d i n g o p e r a t i o n f o r
instr.dat\nI1);
exit (1);
1
for(i=l;i<=stk-ptr;i++)
{
fscanf (read ptr,"\nI1);

fscanf(readptrIW %d\t ll,&op
-code[i]);
for(j=l;j<=8;j++)
fscanf(read-ptrIW\tu);
for(j=l;j<=8;j++)
{
fscanf (read-ptr,
%d It,
&arg-two [ i] [ j])
fscanf (read ptr,"\ntl)

;
fscanf (read-ptr,
"\nll);
1
fclose (read ptr) ;
/ * printing-of the instruction stack */
printf (
the instruction stack is printed below \nt1)
;
for(i=l;i<=stk-ptr;i++)
{
printf (I1\nw)
;
printf (I1 %d\t ",op-code [ i] )
for(j=l;j<=8;j++)
{
printf ( " %d I1,arg-one[i][j]) ;
printf (n\tll)
;
for(j=l;j<=8;j++)
{
/*
printf (I1 %d n,arg-two[i] [j]) ;

printf ("\nl1);
printf (I1\nt1)
;
printf(
below \n1I);
the various structures are tabulated
printf (I1\nn)
;
for (v=l; v <= 88; v++)
{
for (1=0;1<8;++1)
{
printf
",binary-matrix[v].smatrix.bits rowl[l]);
I
printf (l1\nv1)
;
printf (I1\n1l)
;
for (1=0;1<8;++1)
I1%d \b
It%d \b
printf (
",binary-matrix[v].smatrix.bits -row3[1]);
I1%d \ b
printf
n,binary-matrix[v].smatrix.bits -row2[1]);
printf (ll\nll)
;
printf (I1\nl1)
;
1
for (1=0;1<8;++1)
{
printf (I1\nM)
;
printf (I1\nl1)
;
printf (I1\nn)
;
for (1=0;1<8;++1)
{
p r i n t f ( It%d \b
,binary-matrix[v].sdirection.div -latency[l]);
It
printf ("\nl1);
printf ("\nW);
for (1=0;1<8;++1)
{
printf
I1%d \b
k
fd
k
C,
.. a.
=h rC,l *-
F:
-4
C,
F:
* - F : F:r"': .-
+'a,&/
aN fd
I
IU
Xrl k 0
UfdUk
fd -4 a a
C,C,
m -4 a Q) -4
+ C O k
a)
7 - 4 la k
O
k
-dc,d
da
F+
C,5 - 7
7 +
k 0.-0-4
IkC, k - k
*-
a o
*- u
UC, IC,C,c,cJ
f d X P I I I
*dad k d 0
2~ ; m:zX1;:
~ m
-4mmmfdmrl
i)->num-one[j])
printf
(If
%d If,(instack-ptr +
printf (I1\ntf)
;
{
i)->num two[j])
printf
%d
(If
'I,
is as follows
(instack-ptr +
printf (I1\nw)
;
i=O ;
while ( (mpreg-ptr+0) ->bits[2]<=13)
f
/*
initialising the pointers

(pgm-ptrl)->counter[O] = 1;
(pgm-ptr2)->counter[O] = 0;
/*
the T ON cycle
*/
*/
fetch unit(memory ptr,inh0ld-ptr,pgm-ptrl~pgm~ptr2~

picstatus-ptr, eacstatus-ptr) ;
load pipeline ( m p r e g p t r , b i n pointer,instack ptr,
divf'iow-ptr, multfloi p t r , a d d f l o w - p t r , p r l o g g w ,
varl,var2,var3,prstack
-pt~,subflow
-ptr,deltaflow-ptr);
pipeline ( ) ;
set-logg (prlogg-ptr) ;
/*
the T OFF cycle
*/
time off ( ) ;
output check(trans pointer,prstack ptr,
outstack~ptr,divflowptr,multflow ptr,addflow-ptr,
prlogg ptr,subflow-ptr,deltaflow-ptr,mfieg-ptr) ;
shift-track(divf1ow p t r , m u l t f l o w - p t r , a d d f l o w - p t r ,
prlogg-ptr,
subflow-ptrydeltaflow-ptr) ;
1
}

Ohiou1183468981 PDF

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Ohiou1183468981 PDF

Diunggah oleh

Hak Cipta:

Format Tersedia

[LOOK-AHEAD INSTRUCTION SCHEDULING FOR DYNAMIC

.EXECUTION IN PIPELINED COMPUTERSj

Dynamic Instruction Scheduling

Reducing Branch Penalty

Dynamic Execution of Instructions

Functions Emulating the Stages

Functions Emulating the Stages

Control of the Pipeline

Computer Generation of the State Diagrams

Computer program to Generate State

The advances in computer technology are leading to the

1, CDC STAR-100, CDC 6600 and CDC

7600 have to a large extent pipeline processing capabilities

processes: 1) fetching the instruction, 2) decoding the

on the next instruction and so on. Thus the throughput of

a) The structure of a general pipeline computer

Ideal operation of a pipelined computer system

proposed an algorithm to resolve the dependency situation

1 system. The modifications were made

to reduce the hardware needed for tagging a large bank of

common RS pool and instructions are issued to various

register tags for all its

registers. In both the algorithms [5] and [6], associative

similar queuing scheme with substantially lesser queues. The

This scheme has two

constraints: 1) it cannot handle indirect addressing, 2) the

Ramamoorthy and Kim [lo] have proposed a scheme called

problem is alleviated by providing repetitive functional

is also divided into the dynamic arithmetic unit (DAU) and

Main Memory Module

the logic unit (LU)

illustrated in Fig. 1.2. The operation of the system assumes

different arithmetic operations independently within the

DESIGN OF THE LOOK-AHEAD PIPELINED COMPUTER SYSTEM

As stated in Chapter 1 sequential computers are not

previous cycle. It will change to the current result only

the logic unit (LU) The arithmetic unit is subdivided into

dynamic floating point arithmetic unit (FPAU) The pipeline

space diagram of instruction flow in a pipeline system

The pipeline system with the various units

Instructions and their execution times

The performance of a pipeline is dependent on the order

instructions. The rest of this chapter is

organized in three main sections: 1) dynamic instruction

DYNAMIC INSTRUCTION SCHEDULING:

The main objective of the scheduling algorithm is to

rl, r2, r3;

A potential RAW hazard can occur if the add

a particular stage during the same pipeline cycle. The

RESOLVING THE HAZARDS:

The number of pipeline cycles that an instruction needs

instruction to eliminate operational hazard.

Consider the set of instructions listed below:

The Fig. 2.7 illustrates the ideal flow for the

instruction is defined as the set of resource objects that

The hazards that arise when the instruction flow is

load R3, (A);

load R2, (B);

add R 1, R2, R3;

add R1, R2, R3;

store (X), R1;

load R1, (C);

Occurence of WAR hazard