Anda di halaman 1dari 390

[LOOK-AHEAD INSTRUCTION SCHEDULING FOR DYNAMIC

.EXECUTION IN PIPELINED COMPUTERSj

A Thesis Presented to
The Faculty of the College of Engineering and Technology
Ohio University

In Partial Fulfillment
of the Requirements for the Degree
Master of Science

by
Vij ay K. Reddy Anam,
+.-

June, 1990

TABLE OF CONTENTS

CHAPTER I
Introduction
CHAPTER I1
Design of Look-Ahead Pipelined
Computer System
2.1

Introduction

2.2

Dynamic Instruction Scheduling

2.3

Reducing Branch Penalty

2.4

Hardware System

CHAPTER I11
Design of Dynamic Pipelined
Arithmetic Unit
3.1

Introduction

3.2

Principle of Operation of
the CSA Tree

3.3

Conversion of Unifunction
Pipeline to Multifunction Pipeline

3.4

Dynamic Execution of Instructions

CHAPTER IV
Instruction Execution in the Pipeline
System
CHAPTER V
Computer Simulation and Experimental
Results

5.1

Functions Emulating the Stages


of the PIU

5.2

Functions Emulating the Stages


of the PEU

217

5.3

Control of the Pipeline

219

5.4

Computer Generation of the State Diagrams

223

5.5

Experimental Results

225

CHAPTER VI
Conclusions and Discussions
REFERENCES
APPENDIX
A.

State Matrices

B.

Computer program to Generate State


Matrices
Simulation Program

CHAPTER ONE
INTRODUCTION

The advances in computer technology are leading to the


advent of high speed computers which are cost effective and
faster than their predecessors. Main frame machines like the
Texas Instruments TI-ASC, IBM System/360 Model 91 and 195,
Burroughs PEPE, CRAY

1, CDC STAR-100, CDC 6600 and CDC

7600 have to a large extent pipeline processing capabilities


in their instruction and arithmetic units or in the form of
pipelined special purpose functional units [l-41.
Pipelining is a way of imbedding parallelism in a
system. The principle of pipelining is to partition a
process into several subprocesses and execute these
subprocesses concurrently in dedicated individual units.
This is analogous to the operation of an assembly line in
the automotive industry. In a non-pipelined computer system,
the execution of an instruction involves the

following

processes: 1) fetching the instruction, 2) decoding the


instruction , 3) fetching the operands, and 4) executing the
instruction. In a pipelined system, instruction execution
can be split into four subprocesses which are performed by
dedicated units functioning concurrently. The advantage of
this operation is that while a unit is operating on an
instruction, the immediately preceding unit can be operating

on the next instruction and so on. Thus the throughput of


a pipelined system is much higher than a non-pipelined
system. The overlapped execution is depicted in a space time
diagram in Fig. 1.1.
The second generation and earlier computers employed
arithmetic and logic units which were unsophisticated and
under-utilized. The introduction of pipeline techniques in
t h e processor design necessitated t h e advent of new
algorithms to control the instruction flow and resolve any
hazards that might arise in execution of instructions.
Several look-ahead algorithms have been proposed with the
capabilities of executing more than one instruction at the
same time. These algorithms were successfully employed in
many third generation computers involving multiple execution
units.
T h e look-ahead algorithms were designed at the
processor level and involved the following common tasks: 1)
detecting the instructions that can be executed
concurrently, 2) issuing the instructions to the functional
units, and 3 ) assigning of t h e registers t o various
operands. The ideal throughput is difficult to achieve due
to dependencies within the instructions of a program. The
data dependencies have to be resolved by either scheduling
t h e e x e c u t i o n of t h e instruction o r by placing t h e
instruction into a buffer and monitoring the registers for
resolving the instructional dependencies. Tomasulo [5] has

PIPELINE CYCLES
lnstruction
lnstruction
lnstruction
lnstruction
lnstruction
lnstruction
lnstruction

1
2
3
4

5
6
7

a) The structure of a general pipeline computer


b) The ideal flow of instructions in time space.

Fig 1.1

Ideal operation of a pipelined computer system

proposed an algorithm to resolve the dependency situation


by creating a reservation station (RS) to hold instructions
that are awaiting execution. Instructions remain in RS until
the operand conflicts are resolved. The RS monitors a common
data bus and captures the operands for the instructions as
they become available. The instruction identifies its
operands by an address tag scheme. Each source register is
assigned a ready bit which determines the usage of a
register. A register is set busy, if it is the destination
of an instruction in execution. If a source register is busy
when an instruction reaches the issue stage, the tag of the
register is obtained and attached to the instruction and the
instruction is forwarded to the RS. If a sink register is
busy, then a new tag is attached to the instruction against
the sink register and the tag is updated on that register.
This system is expensive to implement. Each register has to
be tagged and each tag needs an associative comparison
hardware to carry out the tag matching process. The problem
is compounded if the number of registers is large. Sohi and
Vajapeyam [6] have modified and extended T u m a s u l o ~ s
algorithm

for

CRAY

1 system. The modifications were made

to reduce the hardware needed for tagging a large bank of


registers. The tags are all consolidated into a tag unit
(TU). The tags are issued to registers from the TU unit and
are returned t o the common pool as soon as the tag is
released. The reservation stations are combined into a

common RS pool and instructions are issued to various


functional units as they become ready. This scheme relies
on the tag comparing hardware for proper execution and still
requires a large number of

register tags for all its

registers. In both the algorithms [5] and [6], associative


tags are compared while forwarding a single instruction. If
the instructions that are awaiting execution are large in
number, the process of associative comparison is time
consuming and cannot be avoided. Keller [7] proved that
optimality of resolving dependencies could be achieved by
a control scheme that employs first-in first-out (FIFO)
queues. Unlike the previous algorithms, these queues
eliminate the associative search process. Each queue is
associated with each pair of conflicting operations. An
operation will belong to a queue if it is an operation that
is associated with that queue. The elements stored in the
queue are represented by tokens. Each operation involves a
distinct token. When an instruction enters the issue stage,
it places a token at the tail of each queue to corresponding
to the operation. Before an operation begins, there must be
a corresponding token at the head of each queue to which the
operation belongs. When the operation is completed, the
tokens are removed. Each queue is implemented as a link
list. The disadvantage of this scheme is that if there are
m different binary functions and n different registers, the
number of queues would be (m*n14. Dennis [8] proposed a

similar queuing scheme with substantially lesser queues. The


queues are not FIFO in nature. Each queue corresponds to a
single register. Token interchanging can occur in a
nondeterministic fashion, casts doubts on the efficiency of
such an implementation. Tjaden and Flynn [9] have proposed
a scheme wherein a block of M instructions can be executed
simultaneously. The scheme analyses the dependencies of a
block of instructions and issues a set of independent
instructions for execution.

This scheme has two

constraints: 1) it cannot handle indirect addressing, 2) the


source operands, the sink result, and the next instruction
must be specified by defining their locations in

storage.

Ramamoorthy and Kim [lo] have proposed a scheme called


dynamic sequencing and segmentation model (DSSM) for
efficient sequencing of instructions with very low
overheads. The overheads are reduced by overlapping the
unproductive administrative and bookkeeping computations
with the execution of computational tasks. The end result
is the efficient exploitation of parallelism. Smith and
Weiss[ll] have proposed a modified scheme of Thorton's
algorithm [ 1 2 ] for the Cray-1 system. In this algorithm,
dynamic scheduling is adopted and the associative tag
comparisons are eliminated.
The effectiveness of the above mentioned schemes are
dependent on the availability of functional units.

This

problem is alleviated by providing repetitive functional

u n i t s a s p r o v i d e d i n t h e T I - A S C c o m p u t e r [ 1 3 ] and
reconfiguring the units as needed. The general approach is
to provide a static functional unit for each class of
operations. Static functional units can execute instructions
only when the operation defined by the instruction fall
within the same class for which the unit was designed. The
Astronautics ZS-1 [14] operates on a decoupled architecture
and supports two instruction streams. This machine is
capable of forwarding two instructions to the execution
units within a clock period. The dependent instructions are
held at the issue stage until the dependency is resolved.
The two streams are unequal in length and are supported by
multiple static execution units. Data can be copied between
the two units via a copy unit. Queues are used for memory
operands providing a flexible way of relating the memory
access functions and floating point operations. This
provides a dynamic allocation of memory access functions
ahead of t h e floating point operations. There is no
reordering of instructions within a pipeline.
In this research a system is developed which executes
instructions dynamically. The hardware is a pipelined system
consisting of two fundamental sub-systems: the pipelined
instruction unit (PIU), and the pipelined execution unit
(PEU). The PIU can further be divided into the fetch unit
(FTJ),

the decode unit (DU), and the issue unit (IU). The PEU

is also divided into the dynamic arithmetic unit (DAU) and

Main Memory Module

r-l
Fetch Unit

Unit II

Unit I

Fig. 1.2 Proposed pipeline system shown with the sub units

the logic unit (LU)

. The overall

system configuration is

illustrated in Fig. 1.2. The operation of the system assumes


no shuffling of instructions by any compiler. The hardware
supports two instruction streams which are necessary for
executing branch instructions. The DAU can execute

three

different arithmetic operations independently within the


same pipeline cycle. This improves the performance over a
similar static unit capable of executing a single operation
at a time. A simple tagging system is used to resolve the
dependency within instructions. There are no associative
comparisons necessary in this algorithm. The instructions
are held in delay stations (DS) present in the stages of the
execution units. An instruction is held in a stage only if
it needs the missing operand to enter the next stage. The
data is fed to the DS via a common data bus (CDB).
The remaining of this thesis is organized in six
chapters. Chapter I1 introduces the system and explains the
function of each sub-system along with the scheduling of
instructions. Chapter I11 describes the operation and the
design of the DAU. It also includes the generation of state
diagrams, to predict the latencies and to schedule the
execution of instructions in the DAU. Chapter IV explains
the operation of the proposed system. Chapter V deals with
the computer simulation of the system and the experimental
results. Chapter VI includes discussion and conclusions.

CHAPTER TWO

DESIGN OF THE LOOK-AHEAD PIPELINED COMPUTER SYSTEM

2.1

INTRODUCTION:

As stated in Chapter 1 sequential computers are not


efficient in utilizing their resources. The serial design
principles do not allow any independency to the functional
units present in the central processing unit (CPU). The
instructions are executed serially one at a time. There is
no overlap between two successive instructions in the
execution phase. This leads to many of the functional units
being idle most of the time. The new generation complex
instruction set computers (CISC) such as the Intel 80286,
80386, 80486, Motorola 68020, 68030 have incorporated
pipelining techniques at the fetch level. The general
pipeline system consists of stages devoted to fetch, decode,
issue and execute. These stages operate concurrently.
Elements are provided between the stages to synchronize the
flow of data from one stage to another. This could also be
achieved by incorporating these elements as a part of each
stage. At the beginning of every pipeline cycle, each stage
receives data from the previous stage. The data is processed
and the result is forwarded to the next stage, at the end
of the cycle. During the cycle, the output of a stage will
contain the result obtained from processing the data of the

11

previous cycle. It will change to the current result only


at the end of the current cycle. This is necessary so as to
prevent the result of one stage preemptively influencing the
operations of the next stage. The process is shown in a time
space diagram in Fig. 2.1.
The pipelined system proposed in this research is an
instruction look-ahead system, which consists of four
fundamental units. The system is illustrated in Fig. 2.2.
The first three units comprise the pipelined instruction
unit (PIU) and the last unit is the pipelined execution unit
(PEU). The PIU consists of the following units: the fetch
unit, the decode unit, and the issue unit. The execution
unit is made up of the pipelined arithmetic unit (PAU) and

the logic unit (LU) The arithmetic unit is subdivided into


the dynamic fixed point arithmetic unit (DAU) and the

dynamic floating point arithmetic unit (FPAU) The pipeline


arithmetic units consists of seven stages and can perform
the operations of addittion, subtraction, multiplication and
division. The dynamic nature of the arithmetic unit is
exploited by t h e system t o initiate m o r e t h a n o n e
instruction in a single pipeline cycle. The individual
operations take different amounts of time to execute. The
table listed in Fig. 2.3 lists the execution time of the
various arithmetic and logic operations. The design of the
PAU is described in more detail in Chapter 3. The design of
the LU and FPAU is left for further research.

Pipeline cycle # 0

Latch 1

Fetch
unit

Latch 2

Latch 3

Decode

Issue
unit

'
I
,unit

lnstr 1
1

Pipeline cycle # 1
Latch 1
I

Fetch
unit
lnstr 2

I
n
s
t

* +
r
1

Latch 2

Decode
unit
lnstr 1

Latch 3

lssue
unit

- u

Pipeline cycle # 2

Latch 1

lnstr 3

+ st ,L

lnstr 2

r
2
u

Fig.

2.1

Time

Decode
unit

- u

Latch 3

I
n

I
Fetch
unit

Latch 2

Issue
unit

s
* t +

lnstr 1

r
1

space diagram of instruction flow in a pipeline system

Fetch unit

Decode unit

Issue unit

Logic unit

Fixed point

Floating point

arithmetic unit

arithmetic unit

Fig. 2.2

The pipeline system with the various units

Instruction

Instruction Type

Add / Subtract

Arithmetic

Multiplication

Arithmetic

Division

Arithmetic

23

Store / Load

Logic

And / Or / Not

Logic

Table

Fig.

2.3

Execution
Time

2.1

Instructions and their execution times

15

The performance of a pipeline is dependent on the order


o f t h e instructions in the instruction stream. If
consecutive instructions have data and control dependencies
and contend for the same resources then hazards will develop
in the pipeline system and the performance will suffer. To
improve performance, it is often possible to schedule the
instructions, so that the dependencies and resource
conflicts are resolved. There are two different ways that
instruction scheduling can be carried out. Firstly, it can
be done at compile time by the compiler or the linker. This
is referred as static scheduling because it does not change
as the program is being executed. Secondly, it can be done
by hardware at the execution time. This is referred to as
dynamic scheduling. Most compilers for pipelined processors
do some sort of static scheduling. The static scheduling
does not have any information about the dependencies and
hence the optimization is highly relative to the type of
program being compiled. Dynamic scheduling on the other hand
is independent of the compiled instruction code and can take
advantage of the dependency information at the time of
issue. The dependency information is not available during
the compile time. In this research a dynamic instruction
scheduling algorithm is proposed based on the execution time
periods of

instructions. The rest of this chapter is

organized in three main sections: 1) dynamic instruction


scheduling, 2) reducing branch overheads, and 3) the

hardware system.
2.2

DYNAMIC INSTRUCTION SCHEDULING:

The main objective of the scheduling algorithm is to


overcome the four main hazards: 1) read after write (RAW),
2) write after write (WAW), 3) write after read (WAR), and
4) operational hazard. Their significance is worth more
elaborate explanation. The registers and memory are known
as resources. A RAW hazard occurs when an instructions tries
to read a resource that has not completed its last write
process. A WAW hazard occurs, if an instruction attempts to
write into a resource that has yet to complete its previous
write operation. A WAR hazard occurs when an instruction
tries to write into a resource which has not completed its
previous read operation.
Consider the following instructions:
load

.....
load
.....

r3, (A);

.
..

r2, (B) ;
I

add
store

rl, r2, r3;


( X ) , rl;

load

rl;

.....

A potential RAW hazard can occur if the add


instruction is executed before the load instructions could
update either r3 or r2. The add instruction may receive a
value that is outdated, if executed. The hazard is
illustrated in Fig. 2.4. A WAR hazard can occur if the third
load instruction is overlapped with the add instruction. In

17

this case the resource (X) will be loaded with the result
of the third load instruction, before the store instruction
c o u l d a c c e s s rl. I n s i m p l e r t e r m s , t h e t h i r d load
instruction will reinitialize rl soon after the add
instruction has initialized it. These events would take
place before the store instruction access rl. The hazard is
illustrated in Fig. 2.5. A WAW hazard occurs when the third
load instruction updates rl before the add instruction. This
is shown in Fig. 2.6. The operational hazard takes place if
more than one instruction attempts to use the facilities of

a particular stage during the same pipeline cycle. The


common form of this hazard is that two instructions are
scheduled to start execution at the same time from the same
stage. This hazard can be eliminated by using the state
matrix of the functional pipeline unit, to schedule the
execution of instructions during initiation, into the
arithmetic unit.
2.2.1

RESOLVING THE HAZARDS:

The number of pipeline cycles that an instruction needs


t o complete execution is fixed by the design of the
execution unit. This information is used as the basis for
scheduling the instructions. The instruction scheduling is
carried out by the issue unit and the execution unit. The
issue unit schedules the instruction to eliminate the RAW,
WAW and WAR hazards. The execution unit schedules the

instruction to eliminate operational hazard.

Consider the set of instructions listed below:


load
load
mult
store
add
store
load
load
mult
store

rl, ( X ) ;
r2, (Y);
r3, rl, r2;
( Z ) , r3;
r3, rl, r2;
(U)I r3
r4, ( B ) ;
r5, (D) ;
r3, r4, r5;
(V) 1 r3

rl <-- (X)
r2 <-- (Y)
r3 <-- rl + r2
(Z) <-- r3
r3 <-- rl + r2
( C ) <--r3
r5 <-- (B)
r5 <-- (D)
r3 <-- r4 * r5
(C) <-- r3

The Fig. 2.7 illustrates the ideal flow for the


above s e t of instructions.

T h e d o m a i n D(1) o f a n

instruction is defined as the set of resource objects that


m a y e f f e c t t h e i n s t r u c t i o n I. T h e r a n g e R(1)
instruction is defined as the set of resource objects that
are modified by the instruction I. A RAW hazard between an
instruction I and J will be present, if the intersection
between R(1) and D(J) is not a null set. A WAW hazard will
occur, if the intersection between R(1) and R(J) is not a
null set. A WAR hazard will occur if the intersection
between D (I) and R(J) is not a null set. Tabulating the
conditions below:
R(1)
R(I)
D(I)

11

D(J) =
R(J) =
R(J) =

0
0

for RAW
for WAW
for WAR

(2 1)
(2.2)
(2.3)

The hazards that arise when the instruction flow is


i d e a l i s s h o w n in F i g . 2.8. A h a z a r d f r e e f l o w i s
illustrated in Fig. 2.9. This flow is achieved by scheduling
the execution of the instructions. A time window is provided
for each instruction to start and complete its execution.

load R3, (A);

load R2, (B);

add R 1, R2, R3;

0
E

ADD instruction is issued for execution while the previous two instructions
are in execution.
Fig. 2.4 Occurence of RAW hazard.

add R1, R2, R3;

store (X), R1;

load R1, (C);

C
E

STORE instruction is issued after the LOAD instruction has completed execution

Fig. 2.5

Occurence of WAR hazard

add R1, R2, R3;


store (X), R1;
load R1, (C);

LOAD instruction completes execution before ADD instruction


Fig. 2.6

Occurance of WAW hazard

RAW hazards In ideal flow

load r l ,

(X); F D l E E ~ E EE

F D I EEEEEE

load r2, (Y);


mult 3

RAW hazard between the


multiplication instruction
and the previous two load
instructions.

F Dl E EE E E E E E

mult 3 r 2 ;
store (Z), r3;

RAW hazard between the


store and the
multiplication instruction.

FDI E

load r l ,

(X); F D I E E E

load r2,

(Y);

RAW hazard between the


add instruction and the
previous two load
instructions

FDI EE

add r3, r l , r2;

add r3, r l , r2;


store (U), r3;

load r4,

FDI EEE
FDl EEEEEE

(B);

FDI
FDl EEEEEE

load r5, (D);


mutt

F DI

r3, r4, r5;

E EEEEEEE

RAW hazard between the


store and the add
instruc80n

RAW hazard between the


mult instruction and the
previous two load
instructions

lJ

mult

EEEEEE

r3, r4, r5;

store ( V ) ,

EEE

r3;

RAWhazardbetweenthe
store and the multipllcabon
instrucbon.

WAW hazards in Ideal flow

mult

add r3, r l , r2;

Fig. 2.8

F D I

WAW hazard between the


store and the multiplication
instruction.

The various hazards in the ideal instructional flow

Pipeline
Cycles
load r l , ( X ) ;

load r2, (Y);

mult

r3, r1, r2;

store (Z),

r3;

add r3, r l , r2;

store ( U ) , r3;

load r4, (B);

load r5, ( D ) ;

mult

r3, r4, r5;

store ( V ) ,

Fig. 2.10

r3;

Alloting the time window for hazard free execution of instructions

25

The Fig. 2.9 is modified to show the time window in Fig.


2.10. The execution time is fixed for an instruction.
Considering Fig. 2.9, the RAW hazard between the instruction
1 and instruction 3 will be resolved as soon as instruction
1 completes execution. The same argument can be applied

between instruction 2 and instruction 3. Basically, the


condition listed in equation 2.1 must be false. To schedule
the instruction 3 it is necessary to know the time that the
instructions 1 and 2 would complete execution. This is
illustrated in Fig. 2.11.

Instruction 4 depends on

instruction 3 which in turn depends on instruction 1 and 2.


Fig. 2.12 illustrates the resolving of RAW hazard between
instruction 1 to instruction 4.It is possible to generalize
that when the instruction I initializes a resource R and an
instruction I+k (k>O) reinitializes the same resource, then
all the instructions between I and I+k will be dependent on
instruction I for the resource R. This dependency will last
as long as the instruction I is in the process of execution.
The concept is illustrated in Fig 2.5~.Thus the time when
instruction I would complete execution is important to
schedule the dependent instructions. In our example, the
time when R1, R2 and R3 would be initialized determines the
exact time window for execution of instructions 3 and 4.
Instruction 3 is delayed for execution until R1 and R2 have
been initialized. Similarly, instruction 4 is delayed until
R3 is updated.

0 Start o f execution

Issue cycle

1 End of execution
Fig. 2.1 1

The various events in pipelined execution of instructions

pipeline cydes

load R1, (X);

9 1 0 1 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 2 0 2'

load R2, (Y);

mutt R3, R1, R2;

E E

E E

store (Z), R3;

Issue cycle

0 Start of execution

0 End of execution
Fig. 2.12

Resolving of RAW hazard from instruction 1 to instruction 4

F D I E E E E E E

store (U), r4;


load r4,

store ( V ) ,

(B);

r3;

F D l E E E E E E
F D I E E E E E E

F D I E E E E E E

The instructions between the two horizontal lines are dependent on the first
multiplication instruction. These instructions will have to be scheduled depending
on the availability of the result of the multiplication instruction.
Fig. 2.13

The hifhlighted instructions are dependent on register R3

28

The times when the four instructions would complete


execution are also shown in Fig. 2.13. If no scheduling is
carried out, instruction 3 will be issued during the fifth
pipeline cycle. At this time, instruction 1 would require
two cycles and instruction 2 would require three cycles to
complete execution. To resolve the resource conflicts,
pointers are associated with each resource and these are
used to monitor the write processes of each resource. Let
pointers C1 to CN represent pointers that are associated in
a one-to-one correspondence with the registers R 1 to RN.
Each time an instruction is issued for execution, the
pointer corresponding to the sink resource is loaded with
the time that the result would be placed into the resource.
If vp represent the value of the pointer, then the value of
vp is numerically equal to the difference between the time
of issue and the time that the sink resource (associated
with this pointer) is updated with the result. For example,
if the instruction is issued for execution during the fourth
cycle and the result of the instruction will be available
in the sink register during the seventeenth cycle then vp
would be equal to thirteen as shown below:
vp = 17

4 = 13 pipeline cycles.

During the cycles that follow, the contents of the pointer


is decremented by a single step in each cycle. This is due
to the fact that the instruction is one step closer to
completion of execution with each passing pipeline cycle.

29

Thus when vp reaches zero, the result will be available in


the sink register (associated with the pointer).
Initialising t h e pointers at the time of issue and
monitoring the pointers, will give the information as to
when the write process to the resource (associated to the
pointer) is completed. Hence the instructions that are
dependent on this resource, will have to be delayed until
vp decrements to zero. Considering the instruction set, the
value present in C1 will be equal to 2 and C2 will be equal
to 3 , at the time instruction 3 is issued. Fig. 2.14
illustrates the pointer values during the ideal instruction
flow. Fig. 2.15 illustrates how Fig. 2.14 is modified to
obtain Fig. 2.13. Using the pointers to denote the time when
each resource would complete its recent update, the
algorithm is developed as follows.
Let the instruction stream be represented by a set of
instructions:

IS

I1,I2,I3,I 4 , .

. . . .In } , where

the

maximum number of instructions in the window in memory at


any pipeline cycle. Let the registers in the system be
represented as a register set R

= {

r1,r2,r3,.....

)'

represents the total number of registers in the system. Let


C

= {

. . . . , c,

c1,c2,c3,

represent a set of counters

(pointers) that are assigned to the register set. There is


a one-to-one correspondence between the counters and the
registers. A counter is assigned to a single register and
vice versa. For simplicity, we assume that a counter denoted

Pipeline
Cycles

load r l ,

(X);

Polnter
C1

load r2, (Y);

Pointer
C2

mult

r3, r l , (2;

Pointer
C3

store (Z),
r3;

Fig. 2.14

Pointer values associated with the sink registers.

32

by

subscript j is assigned to the register with the

subscript j . Each counter carries the information about the


number of pipeline cycles that are needed for the assigned
register to assume the new value, when initialized by the
most recent instruction. The algorithm is based on the
following:
1)

The instruction order has to be maintained.

2)

The value carried by the counter, which is assigned to

a specific register can change only when the register is


used as the sink register by the instruction that is being
scheduled.
3)

The maximum number of source registers that can be

specified are two and the number of sink registers is one.


4)

The registers and the counters are all initialized to

zero at the start of operations. The first instruction is


executed

assuming zero possibility of either hazards or

collision. Let each instruction be represented by

Ik =

O C , r a trb,r,, ca,cb,
C, ) , where OC is the op-code of the

instruction, ra,rb,rcare the registers used by the


instructions, and ca,cb,ccare the counters that are
designated to the registers ra,rb,and r, respectively. To
schedule an instruction, there are three possibilities that
need to be investigated: 1) all counter elements associated
with the source registers are zeros, 2) only one counter
element associated with the a single source register is
zero, and 3 ) none of the counters assigned to the source

registers are zeros. Equations for the instruction


scheduling are developed as analysis is carried out for each
case. The equations are then summarized.
CASE 1:

In this case no RAW hazard is involved. The data

operands are currently available in their respective source


registers and the instruction can be issued to the execution
unit without assigning any delay. This instruction will
place

the result in the sink register after execution. The

result will be assigned to the destination register after


T pipeline cycles which is given by

T,

(2.4)

T,

where T, is the time required for execution by the execution


unit and is fixed by the system design. T, is the system
delay that is fixed by the overheads in the system. Hence
the result will be placed in the sink register after T
pipeline cycles to the present cycle. Consider the set of
instructions given below:
2.

load
load

rl, (X); rl <-r2, (Y); r2 <--

8.
9.

mult
store

1.

..

. .. .

. . . . . ..

(X)
(Y)

..........

1 , 2
r3 <-- rl + r2
(Z), r3; (Z) <-- r3

Let the first instruction be the load instruction. This


is issued to the execution unit and cl is initialized to 6.
It takes six pipeline cycles for rl to read (X). Similarly
in the second pipeline cycle c2 is loaded with 6. The
contents of c, during the second pipeline cycle will be 5.

34

If the multiplication instruction is executed after the

eighth cycle, the data is readily available and the


instruction can be issued without assigning any delay.
In the above example Te is equal to 4 pipeline cycles
and T, is equal to 2 pipeline cycle resulting in T being 6
pipeline cycles. The multiplication instruction is issued
at the ninth pipeline cycle. TteSt
is used to check against
the WAW hazard and is numerically equal to the sum of T and
any other delay.
'test

Tadditional delay =

(2-5)

In this case where no RAW hazard occurs, the additional


delay term is

0.

The WAW hazard is a possibility if the c , ~ ~ is


~ (not
~ ~ ~ )
zero. The subscript

llsink(old)llrefers to the current

value of the counter associated with the sink register (


used by the present instruction) which has not yet been
updated by the issue unit. This implies that a previously
initiated instruction I using the same sink register has not
yet been updated. If the present instruction is denoted as
instruction J, then R(1)
delay

R(J)

is not equal to

The TSink-

is the delay assigned to the instruction by the issue

unit to resolve the WAW dependencies. The calculation of the


delay depends on two cases: A) the value of the counter
Csink(old)

is greater than T,,,,, and B) the value of the

is( less
~ ~ ~than
)
TteSt.
counter element c , ~ ~ ~
CASE A:

Csink(old)

is greater than TteSt


implies that a WAW hazard

will occur. The instruction has to be delayed until the

in
Csink(old)

is less than Ttes,. The difference between

Csink(old)

and

can be set by the system or be a fixed

'test

value. In this research the value is fixed and is equal to


two pipeline cycles. The Tsink-delay
is calculated as follows:
Tsink-delay

Let

Tinst-delay

Csink(old)

- (Te +

(2 6 ,

1)

represent the total time delay assigned to the

instruction to resolve the RAW and WAW hazards. The Tinst-delay


is numerically equal to the Tsi,,k-delay
in the absence of the
RAW hazard.
Tinst-delay

(2-7)

Tsink-delay

The new value of c

~ can
~ be
~ set
( according
~ ~ ~ )to the

following equation :
Csink(new)

= Te

= Ts

(Tinst-delay
Ts

Csink(old)

Csink(old)

Csink(old)

(2.8)

( T e + 1)-1 (2.9)
(2.10)
(2.11)

In equation 2.8 the term lt(Tinst-delay- 1)" is used


because of the overlap of the delay value becoming zero and
the begining of execution for the instruction. If a delay
of 1 0 c y c l e s is assigned t o t h e i n s t r u c t i o n , t h e n
instruction will start execution when the delay decrements
to 0. Thus the time that the result will be loaded into the
register will be (9 + execution time) rather than (10

execution time). For example, to execute the multiplication

36

instruction, the contents of c3 must be evaluated. If c3 is


non zero then there is a possibility of WAW hazard. Let the
contents of c3 be 1 2 at the time the multiplication
instruction is being issued. This implies that the previous
instruction that has used r3 as its sink has not completed
the execution and there will be an additional 12 cycles
before the previous instruction will update r3. The Te for
the mult instruction is 6 pipeline cycles. From the equation
I

Tinst-delay

is computed to be 7 pipeline cycles. It is

evident that if the instruction is not delayed, the present


multiplication instruction will initialize the register r3
with the wrong value. This is not acceptable as it gives
rise to WAW hazard. The multiplication instruction should
be executed after 7 pipeline cycles. The result of the
present instruction will be loaded into r3 after 14 pipeline
cycles from the current cycle. Hence c3 is initialized to 14
before the instruction is issued. The new value of c3 is
used

to determine WAW hazards with the instructions

logically following the multiplication instruction.


CASE B:

The counter element assigned to the sink register

is less than or equal to T,,,,. The possibility of WAW hazard


exists and this warrants that a delay be introduced by the
system. The delay assigned is two pipeline cycles. The
differs from the previous case.
calculation of TsinkSdelay
(2.12)

Tsink-delay =

if

Ttest

'

Csink(old)

= 0

Tsink-delay =

if

'test

Tsink-delay =

if

Ttest

Csink(old)

= 1

0
Csink(old)

> 2

The new value of c , ~ ~ is


~ (calculated
~ ~ ~ )
as follows:
Csink(new)

- T + Tinst-delay
= T + 2

for equation (2.12)


Csink(new)

- T + TinSt-delay
= T + 1

for equation (2.13)


Csink(new)

'

Tinst-delay =

for equation (2.14)


CASE 2 & 3:

.
.

In this case the counters associated with

one of the source registers are non-zero. The RAW hazard is


a definite possibility and has to be resolved. The
instruction must necessarily be delayed until the source
is
dependencies are resolved. Another delay term Ts,c-de,,Y
introduced in the total delay equation. Tsrc-delay
is the
additional delay element in the calculation of Ttest.
This
delay term is equal to the non-zero counter value associated
with the source register in case 2 and is equal to the value
computed by equation 2.20, in case

3.

Both cases cannot

is necessary as the
exist simultaneously. The Tsr,.delay
execution of an instruction will have to be delayed until
the RAW hazards are resolved. The test total time is now
equal to:

case 2:

Tsrc-delay

Csource- reg

+ 1

case 3:
-

Tsrc-delay

( Csource- reg1 r

) + I

Csource- reg2

(2.20)

The WAW hazard is checked in the same manner as in case 1.


The difference between case 2 and case 3 is that Tsrc-delay
has to be taken into consideration in deciding Tinst.delay.
The
Tinst-delay is calculated simillar to case 1.
If csink(o1d) > Ttest:
-

Tinst-delay

Csink(new)

Csink(old)

if

<

Ttest

=s '

'test

Ttest

Tinst-delay

Csink(old)

'

(2.21)

Csink(old)

(Te + 1)

Csink(old)

(2.22)

(Te + 1)-1 (2.23)

(2.25)

Csink(old)

= 0

+ 2

(2.26)

Tsrc-delay

(2.27)

Csink(old)

(2.28)

> 2

'src-delay

~ calculated
~ ( ~ ~ as
~ follows:
)
The new value of c ~ ~ is
%ink(new)

for equation (2.26)


Csi nk(new)

(2.24)

Csink(old)=

Tinst-delay

if

Ts

- Tsrc-delay

(Tinst-delay

Tinst-delay

if

the delays are calculated as follows:

'test!

=e '

If

Csink(old)

(Tinst-delay
'src-delay

(Tinst-delay

'

Tsrc-delay

for equation (2.27).

'

Csink(new)

('inst-delay

(Tsrc-delay

for equation (2.28)

The equations to resolve the dependencies are summarized


below:
In the abscence of RAW and WAW hazards, the expressions
'test
'test

and
=

are as follows:

Csink(nex)

T
=

Csink(new)

~ ~ -~Csource~ (reg1~- ~Csource~ reg2


)

The values of c

= O

In the abscence of RAW hazards, TteSt


and

Csink(new)

are

shown below:
=

'test

(2.32)

'inst-delay

if

if

'

Csink(old)

'inst-delay

Tinst-delay

The

2 - (Te - 1)

(2.33)

'test

%ink(old)

Csink(new)

if

is equal to zero, one or two


<

'test

Csink(old)

'

'

('inst-delay

> 0.

csource-regl

Csource- reg2 = 0 .

The equations for determining the delays to resolve RAW


and WAW hazards are summarized below:
'test

'inst-delay

('src-delay

- Csink(old)

(2.35)
2

(Te - 1)

(2.36)

40

if

>

Csink(old)

'test

is

Tinst-delay

(Tsrc-de[ay

added with zero, one or

two)

if

Ttest

Csink(new)

if

<

Csink(old)'

Tinst-delay

(Tinst-delay

(2.37)

> 0.

The process of scheduling the instructions is shown in


Fig. 2.16. The result of the scheduling process is
illustrated in the Fig. 2.9. The individual RAW and WAW
components are derived and are also illustrated in the Fig.
2.16 for each instruction. The algorithms are based on the
counters that monitor the write process to each register.
It is also necessary for the issue unit to recognize the
c a p a c i t y in w h i c h each r e g i s t e r is utilized. T h i s
information is stored in an auxilliary unit which is made
available to the decode and the issue unit. This unit is
known as the instruction status unit. The instruction status
unit is a two dimensional array of fields representing the
decoded instruction. The unit contains four major fields.
The fields are encoded in the following manner: 1) the
opcode field contains the opcode of the present instruction,
2) the execution time field represents the execution time

TI 3) the R field denotes the utilization of the registers


by the instruction. These registers are the general purpose
system registers that are utilized by the functional units.
They can be used as a source or as a destination register

Pipeline Cycle # 3
lnstruction

load R1, (X);


RAW hazard delay

I
I

Initial counter values

1
I

WAW hazard delay

I
0
lnstruction delay
I
0

Pipeline Cycle # 4
lnstruction

load R2, (W;


RAW hazard delay

II

Initial counter values

I
1I

WAW hazard delav

Updated counter values

Updated counter values

lnstruction delay

Pipeline Cycle # 5
lnstruction

Initial counter values

IRAW mult
R3, R1, R2;
hazard delay
i
b

I
i

WAW hazard delay


0
lnstruction delay

Pipeline Cycle # 6
lnstruction
store (Z), R3;

Initial counter values

RAW hazard delay

12
WAW hazard delav
t

Updated counter values

Updated counter values

Instruction delay

I
Fig. 2.16

13

The counter values while scheduling the instructions

Pipeline Cycle # 7
lnstruction
I add R3, R1, R2;
RAW hazard delay

4
WAW hazard delav

lnitial counter values

I
Updated counter values
b

lnstruction delay

11

Pipeline Cycle # 8
lnstruction

store (U), R3;


RAW hazard delay
I1
13
WAW hazard delay
I

C1

C2
3

C3

C4
1 3 0

C5
0

]=I

Initial counter values

1
1I

1I

Updated counter values

lnstruction delav

Pipeline Cycle # 9
lnstruction

lnitial counter values

load R4, ( 6 ) ;
RAW hazard delay
I
0
1
WAW hazard delav

1
1I
-

lnsruction delay
0

Pipeline Cycle # 10
lnstruction
1 load R5, (D);
RAW hazard delay
1

WAW hazard delay

0
lnstruction delav

Fig. 2.1 6

Updated counter values

I
lnitial counter values

1
I

Updated counter values

The counter values while scheduling the instructions (cont'd)

Pipeline Cycle # 11
lnstruction
1
add R3, R4, R5;
RAW hazard delay
1I
6
WAW hazard delay

6
lnstruction delay

lnitial counter values

1
I

Pipeline Cycle # 12
lnstruction

store (V), R3:


RAW hazard delay
13
b
WAW hazard delay

0
lnstruction delay
b

Fig. 2.16

Updated counter values

Initial counter values

I
I

Updated counter values

The counter values while scheduling the instructions (cont'd)

44

by the instruction, and 4) the C field represents the time


when the registers will be initialized to the new value by
the instructions using the registers as sink registers. The
R and C fields are further divided into subfields. The

number of subfields in the R field are equal to the number


of subfields in the C field. The R fields are set by the
decoding unit. Every subfield in the C field is a counter.
Each counter is associated with a single register. The
counter subfield c, represents the time that the register r,
will be initialized to a new value by the most recent
instruction. The subfield c2 represents r2 and so on.
Similarly, every subfield of the R field represents a single
register. The subfields of R are set by the decode unit. The
subfields are set to 1, if the register is used as a source
register, set to

0 ,

if the register is used as a sink

register and set to 3, if niether are true. The value three


represents don't care. For example, the R 1 subfield is set
to 1 by the decode unit, if register R 1 is used as the
source register by the instruction. The counter fields are
updated by the issue unit. The unit is shown in Fig. 2.17.
The change to Fig. 2.2 is illustrated in Fig. 2.18.
T h e i s s u e u n i t s c h e d u l e s t h e e x e c u t i o n of a n
instruction in each pipeline cycle. The execution of the
instruction may be delayed. The delayed instruction must be
stored until it is ready to execute. Two schemes are
possible: 1) hold the instruction in the issue unit and

0
Code

C - Field

R - Field

Exec
Time
R1

R2

R3 R4

R5

C1

C3

C2

C4

C5

I llllllllll

>

Opcode is the opcode of the Instruction.


lime: Time is the time required to execute the instruction.
R Field : R - field is the field of all registers in the system..
C - Field : The field of the counters that keep track of the registers.

Fig. 2.17

Instruction status unit.

Fetch Unit

i
Decode unit

D
Instruction
status
unit

Issue unit

.c

Fig.

2.18

The

Floating
point u n l

v
1

Logic
unit

E 2
Fixed point
arithmetic
unit

modified pipeline system with the

instruction status unit

47

freeze the total PIU until the dependency is resolved and


2) issue the instruction to a buffer provided at the

entrance to the execution units. The former scheme will


reduce the efficiency of the pipeline system. There could
be instructions downstream, that can be executed and not
dependent on the instructions in execution. In our example,
instruction 7 is not dependent on any of theprevious
instructions. If the PIU is freezed, the instruction 7 will
remain in the fetch unit until the PIU is operational again.

A FIFO queue can be introduced between the units of the PIU


to hold the instructions and keep the fetch unit
functioning. This will create a bottleneck as the issue unit
is still disabled and dynamic scheduling will not be
possible. Thus the effective solution is adopt the latter
scheme and place buffers at the entrance to the execution
units. The non-executable instructions can remain in these
buffers until they are ready to execute. This will free the
issue unit to issue instructions to the execution unit. The
execution unit will also be able to start execution of
instructions that are issued for immediate execution by the
issue unit. In our example, instructions 5 and 6 can be
placed in the buffers and execution of the instruction 7 can
begin during the nineth pipeline cycle. The ideal flow
through the PIU is maintained. The space time diagram for
this scheme is illustrated in Fig. 2.9. The changes in the
structure with relation to Fig. 2.18 is shown in Fig. 2.19.

Fetch Unit
I

Decode unit

4
D
Instruction
status
unit

Issue unit

4
Buffer
I

units

88
-

Buffer

units

units

Logic
unit

Fixed point
arithmetic
unit

Fig. 2.19

Vl

Buffer

Floating
point uni

The pipeline system with the buffer units

Fig. 2.20

Instruction listing to illustrate WAR hazard

Fig. 2.21

Resolving WAR hazard using the counters.

51

The WAR hazard arises when the resources are not


distributed to the instructions in the buffer as they become
available. Fig. 2.9 is reproduced to illustrate the
possibility of RAW hazard in Fig. 2.20. The WAR hazard will
exist between the instruction mult r3, rl, r2, store (Z),
r3, and add

r3, rl, r2. The instructions are highlighted

in a block in Fig. 2.21. The counter values are also shown


alongwith the instructions. The three instructions are
issued to the buffer. The store instruction must capture the
value of r3, before the add instruction changes the content
of r3. When the store instruction is issued, the counter c3
associated with r3 contains a value of 12. It indicates that
the result of r,

r2 will be loaded into r3 after 1 2

pipeline cycles. A pointer is introduced in the buffer


holding the store instruction. This pointer is initialized
to the value of c3 at the time of issue. The pointer counts
down by one in each passing pipeline cycle. The pointer is
independent of c3. At the time that the pointer counts down
to 0, the register r3 will be loaded with the result. This
result can be loaded into the buffer before the instruction
begins execution. Fig. 2.22 illustrates the events. The
buffers in each stage are collectively called as a delay
station (DS). Each delay station consists of 10 identical
buffers called as delay buffers (DB). Each delay buffer (DB)
holds an instruction until it is ready to execute. Each
delay buffer is further subdivided into nine fields:

Pipeline

Cycles
Polnter
c1

load r l ,
(XI;
Pointer
c2
load r2,
(Y);

Polnter
c3
mult r3, r l ,
r2;

....

....
Pointer
c3
add r3, r l ,
r2;

.....
.....
Pointer
c4
load r4,

(B);

.....
Polnter
c5
load r5,

(Dl;

.....
Pointer
c3
mutt r3, r4,
r5;

Fig. 2.22

The various events of the scheduling process

Pr#

ASR1

DSR1

SD1

ASR2

Unit 1
Unit 2
Unit 3
Unit 4
Unit 5
Unit 6
Unit 7
2

Pr # : Priority number attached to each unit


ASR1 : Address of source register 1
DSR1 : Delay of source register 1.
SD1 :Source Data 1.
ID : Instruction Delay.
ASR2 : Address of source register 2
DSR2 : Delay of source register 2.
SD2 : Source Data 2.
DR : Destination Resource
Each Unit is a Delay Buffer.
Fig. 2.23 Structure of delay buffers

DSR2

SD2

ID

DR

54
1) priority number, 2) address of source registerl (ASRl),
3) delay of source registerl (DSRl), 4) source data1 (SD1),

5) address of source register2 (ASR2), 6) delay of source

register2 (DSR2), 7)source data2 ( S D 2 ) , 8) Instruction delay


(ID), and 9) destination register (DR). The structure of the
delay buffers is illustrated in ~ i g .2.23. The DSRl field
indicates the number of pipeline cycles (fromthe present
cycle) required by the source register1 to initialize itself
to the correct value. The same concept applies to the DSR2
field. The ID field indicates the time that the instruction
is allowed t o start the process of execution in the
arithmetic or logic unit. The delay fields essentially
decrement by one step in each pipeline cycle. They do not
count down below zero. The delay fields in the buffers are
the pointers that keep track of the source registers. When
the source operand is not available at the time of issue,
the counter value associated with the source register is
loaded into one of the pointers in the buffer. The address
of the source is also loaded into the address fields in the
buffer. If the value of the counter is loaded into DSR1,
then the address of the source register must be loaded into
ASR1. Regularity is maintained. When any of the delay fields
associated with the sources reach zero, the address of that
source is released from the source address field and the
data is latched in the associated source data field. The
data is read from the common data bus that links each

> .
Register 1

Register 2

Register 3

Register 4

Register N

I
b

t T I L

SPLITTER.

II
MUX5to1
A

PR# ASRl DSRl Data source 1

ASR2 DSR2 Data source 2

Fig. 2.24 Connectionist model of delay buffers.

ID.

DR

Fetch Unit

Decode unit

-C

Thick lines
represent
common data bus.

Instruction
status
unit

R a
e r
9 r
i a
s Y

Issue unit

.
Buffer units

Fixed point
arithmetic unit

Fig. 2.25

Buffer units

b'Buffer

units

dm' e

Floating point unit

The pipeline system shown along with the register array

57

register to the source data fields in the buffers through


a multiplexer. This multiplexer chooses the data path in
lieu with the source address present in the identification
field. The connectionist model is illustratedin Fig. 2.24.
The changes to the structure are shown in Fig. 2.25.
RESOLVING OPERATIONAL HAZARDS:

This collision hazard occurs when the assigned delays


of two different instructions in the same DS, are nullified
in the same pipeline cycle. This hazard also occurs when an
instruction cannot be executed because of latency not being
available. It can be resolved by introducing extra time
d e l a y s t o a l l instructions t h a t a r e i n t h e DS. T h e
scheduling algorithm in the issue unit assigns time slots
for the execution of each instruction. The time slot
assigned to each instruction in the DS is fixed in time,
with respect to the other instructions.

In case of a

conflict between two instructions, the instruction with the


highest priority is executed and a fixed amount of delay is
introduced to all the instructions in DS. The delay is added
to the existing delays of the source delay counters and the
instruction delay counters. The source delay counters which
have already counted down to 0 are not updated by this
operation. The counters in the instruction status unit are
also updated with the same amount of delay. This ensures
that the relative positions of the time slots for execution
of instructions are not changed. The captured data in the

Consider the set of instructions below in the time space diagram

lnstruction 1
lnstruction 2
lnstruction 3

FDIEEEE
FDIEEEE
FDIEEEE

lnstruction 4
lnstruction 5
lnstruction 6

FDI.

Operational hazard present between instruction 5 and 6.

lnstruction 1
lnstruction 2
lnstruction 3
lnstruction 4

FDIEEEE
FDIEEEE
FDIEEEE
F D I .... E E E E E E

lnstruction 5

FDI..

Instruction 6

FDI

EEEEE

.......

EEEE

Additional delay
introduced by the
execution unit.

Fig. 2.26 Resolving the collision of instructions in PEU.

buffers is not lost and the new instructions are scheduled


with the updated c o u n t e r v a l u e s . This principle is
illustrated in Fig. 2.26. In simple terms, the execution
all instructions in DS are moveden-masse in time without
disturbing the order.

the instruction cannot be issued

due to lack of latency, the delay required is equal to the


number of pipeline cycles for the first available latency.
This re-scheduling is carried out independent of the issue
unit. This principle is best illustrated in the example
given below. Consider the instruction set listed below:
load
load
mult
mult
store
store

rl,
r2,
r3,
r4,
r3 ;
r4;

20;
30;
r2,rl;
r2,rl;

The load instruction will be issued in the third


pipeline cycle followed by the second load instruction in
the fourth pipeline cycle. cl and c2 are set to 6 at the time
of issue. rl will contain the value of 20 in the nineth
pipeline cycle and r2 will be loaded with 30 during the
tenth pipeline cycle. The first multiplication instruction
will be issued in the fifth pipeline cycle. During this
cycle, c, will contain the value of 4 and c2 has the value
of 5. The counter c3 associated with the sink register r3
will be set according to the equation 2.16. There is no WAW
hazard as c3 is initially equal to zero. The instruction
delay is computed as given in eqn 2.19 which is 6 pipeline

60

cycles. Thus c3 will be updated to 14. The result of this


instruction will be in r3 at the nineteenth cycle. The
second multiplication instruction is issued next with a
delay of 5 pipeline cycles. The value of c, is set, similar
to the first instruction and is equal to 13. The events and
the counter values are illustrated in Fig. 2.27. The
counters c, and c2 are decremented, as the event of updating
the registers draws nearer. The first store instruction is
issued during the seventh pipeline cycle. The delay is
computed depending on c3 which is equal to 13 cycles. The
last instruction is issued in the eight cycle with an
assigned delay of 12 cycles. The state of the instructions
in the pipeline during the cycles 7 and 8 are illustrated
in Fig. 2.28. At the eleventh pipeline cycle, both the
multiplication instructions are ready to be executed. Two
generic instructions cannot be executed from the same stage
at the same time. The first instruction has a higher
priority and was loaded into the DS one cycle ahead of the
second multiplication instruction. As a result, the
execution of the second instruction has to be delayed by one
cycle. This implies that all the instructions that are
dependent on the second instruction will also have to be
delayed by one cycle. This has a recursive effect on the
instructions downstream. Since the issue unit fixes the time
slot for execution, the relative placement of the time slots
between the second multiplication instruction and the

Pipeline Cycle # 3
load R1, 20;

Initial Counter values

1
Updated counter values

Pipeline Cycle # 4

Initial Counter values

load R2, 20;


Updated counter values

Pipeline Cycle # 5
mult R3, R2, R1;

Initial Counter values

Updated counter values

Pipeline Cycle # 6
mult R4, R2, R1;

Initial Counter values

Updated counter values

Fig. 2.27

Sequence of events and the counter values

Pipeline Cycle # 7
store

Initial Counter values

R3;
Updated counter values

Pipeline Cycle # 8
store

Initial Counter values

I - I
1

R4;

10

10

Updated counter values

Pipeline Cycle #

Initial Counter values

Updated counter values

Pipeline Cycle #

Initial Counter values

Updated counter values

Fig. 2.27

Sequence of events and the counter values (cont'd)

Pipeline Cycle # 7
store

R3;

Initial Counter values

11

11

Updated counter values

Pipeline Cycle # 8
store

Initial Counter values

R4;

Updated counter values

Pipeline Cycle #

Initial Counter values

Updated counter values

Pipeline Cycle #

Initial Counter values

I
c
Updated counter values

Fig. 2.28

Sequence of events and the counter values

Pipeline cyde #

Let the counter values before updating be :


C1

C2

C3

C4

Assuming a delay of 'k' pipeline cycles are needed to resolve the hazard
Counter values after updating are :

Assuming that the contents of DSR1 of unit 1 is 0 and that of DSR2 of


the same unit is 3 and the ID field is 7
Updating the delay buffers by adding the offset 'k' to all non zero
delay fields. The updated delay station is presented below

Pr # : Priority number attached to each unit


ASRl : Address of source register 1
ASR2 : Address of source register 2
DSR1 : Delay of source register 1.
DSR2 : Delay of source register 2.
SD1 : Source Data 1.
SD2 : Source Data 2.
ID : Instruction Delay.
DR : Destination resource

Each Unit is a Delay Buffer.


Fig. 2.29 Updating of the delays due to collision hazard

65

downstream instructions must not be changed. Hence all the


delays are incremented by one. The non zero source delays
are also incremented in the delay buffers. The value of r3
will remain longer in the register for one extra cycle more
than the original scheduled time. The process is illustrated
in Fig. 2.29. In general, the instruction Ij will influence
Ij+,ratherthan Ij-,
. Hence, this displacement does not affect
the previous instructions. It is evident from our example
that the first two instructions are not inconvienced by this
displacement. Graphically it is illustrated in Fig. 2.30.
The logic instructions are also treated in the same manner.
2.3

REDUCING BRANCH PENALTY:

A typical instruction set, of any computer consists of

two types of branch instructions. They are the conditional


branch instructions and unconditional branch instructions.
The unconditional branch instruction will initiate a jump
in the current flow. The conditional jump instruction will
initiate a jump only if the element to be evaluated is
positive to the condition. For example, let a branch
instruction specify a branch to location # 60, if only the
register R5 is equal to zero. The branch will take place
only if the condition is positive i.e., the register R5 is
equal to zero. The sample instruction set listed above is
now modified to include a conditional branch instruction and
is listed below:
1.
2.

load
load

rl, (X);
r2, (Y);

rl <-r2 <--

(X)
(Y)

Pipeline cycle # 6

C1
Before updating

After updating

Before updating
Pr #

ASR1 DSR1 SD1

Unit 1

R1

ASR2 DSR2 SD2


R2

Unit 2
h b
m
U

ID

I33

R3

After updating

Unit 2

Before updating
ASR1 DSRl SD1

ASR2 DSR2 SD2

9 b
i
C

f
f

Unit 2
After updating

n
i
t

Fig. 2.30

The process of capturing the operands and resolving collisions

Pipeline cycle # 7

C1
Before updating

After updating

Before updating
L

h b
u
f

Pr#

ASRl DSRl SD1

Unit 1

R1

R2

R3

unit 2

R1

R2

R4

ASR2 DSR2 SD2

ID

a?

After updating

Before updating

After updating

Fig. 2.30 The process of capturing the operands and resolving collisions (cont'd)

Pipeline cycle # 8
C1

Before updating

After updating

Before updating
Pr #

ASRl DSRl SD1

Unit 1

R1

unit 2

R1

h b
f

ASR2 DSR2 SD2

ID

R2

R3

R2

R4

DR

After updating

Before updating

Unit 2
After updating

Fig. 2.30 The process of capturing the operands and resolving collisions (cont'd)

Pipeline cycle # 9

Before updating

After updating

Before updating

h b
m
U

ASR2 DSR2 SD2

ID

Pr #

ASRl DSR1 SD1

DR

Unit 1

R1

20

R2

R3

Unit 2

R1

20

R2

R4

After updating

Before updating

Mem
After updating

Unit 2

Fig. 2.30 The process of capturing the operands and resolving collisions (cont'd)

Pipeline cycle # 10

C1
Before updating

After updating
- -

Before updating

h b
u
f

ASR2 DSR2 SD2

CR

Pr #

ASRl DSRl SD1

Unit 1

R1

20

R2

30

R3

unit 2

R1

20

R2

30

R4

ID

After updating

Before updating

f
f

Mem
After updating

Fig. 2.30 The process of capturing the operands and resolving collisions (cont'd)

Pipeline cycle # 11

C1
Before updating

After updating

Before updating

h b
u
f

"

Pr #

ASR1 DSRl SD1

Unit 1

R1

20

R2

30

R3

unit 2

R1

20

R2

30

R4

ASR2 DSR2 SD2

ID

a?

After updating

Unit 2

Before updating

Mem
After updating

Mem

Fig. 2.30 The process of capturing the operands and resolving collisions (cont'd)

Pipeline cycle # 12

Before updating

After updating

Before updating
k

Pr #

ASRl

DSRl SD1

Unit 1

R1

20

ASR2 DSR2 SD2


R2

30

ID
0

R4

Unit 2
After updating

Unit 2
--

--

Before updating

g
i
C

b
f
f

After updating

Unit 2

Fig. 2.30 The process of capturing the operands and resolving collisions (cont'd)

Pipeline cycle # 13
C1

Before updating

After updating

Before updating
Pr #

ASRl DSRl SD1

ASR2 DSW

SD2

ID

Unit 1
Unit 2
After updating

Before updating

f
f

Mem
After updating

Unit 2

Fig. 2.30 The process of capturing the operands and resolving collisions (cont'd)

3.
4.

5.
6.
7.
8.
9.

add
store
branchz
load
load
mult
store

r3 , rl, r2;
r3 <-- rl + r2
(Z), r3;
(Z) <-- r3
r3, 100;
branch to 100 if r3 = 0
r4, (A);
r4 <-- (A)
r5, (B);
r5 <-- (B)
r3, r4, r5;
r3 <-- r4 * r5
(C)I r3
(C) <-- r3

The instruction 6 will be executed depending on the


outcome of

instruction 5. Instruction 5 will be fetched by

the fetch unit at the beginning of the fifth pipeline cycle.


It will reach the execution unit at the beginning of eighth
pipeline cycle. It is necessary to stop further issue of new
instructions, until the branch instruction is evaluated. The
PIU is freezed until the validity of the branch instruction
is determined. If the branch is positive, then instruction
at location # 100 will be the next instruction to be issued.
On the other hand, if the result is negative, no branch is
initiated and the next instruction to b e issued is
instruction 6. The time from the sixth cycle to the cycle
that the branch instruction is evaluated, is wasted and
furthermore, a few cycles are lost in reconfiguring the
f e t c h unit. T h i s t i m e c a n b e u s e d t o pre-fetch the
instruction from the destination address and along with
instruction 6. An additional stream is needed to handle the
second fetch. Hence the fetch unit is endowed to feed two
instruction streams. A unit to classify the instruction and
generate the effective address is necessary for the second
stream to become operational. The branch instruction will
not be evaluated until the operand is current. During this

75

time the activity in the decode and the issue units is


suspended, but the fetch unit can prefetch two instructions
and feed t w o FIFO queues. These queues c a n hold the
instructions of the current flow

and the instructions

starting from the destination address. The queues would be


best placed in the decode unit. Two program counters are
used to fetch instructions to both the streams. A path
controller is necessary to direct the instruction flow to
the two queues. The system is modified as shown in Fig.
2.31. The current stream is known as the present instruction

counter stream (PIC) and the secondary stream is termed as


the effective address counter stream (EAC). To maintain
symmetry, the system consists of two issue and two decode
units, one for each stream. The EAC stream will become the
current stream when the outcome of the branch instruction
is positive. The PIC queue is flushed up to the issue unit.
The instruction flow is resumed from the E A C queue. The
first instruction is fetched from memory by initializing the
program counter of the PIC stream with the starting address.
Subsequent instructions are fetched in each pipeline cycle
by incrementing the program counter. The current instruction
is examined by the instruction classifier to classify the
type of the instruction. If the instruction belongs to the
class of unconditional branch instructions, the program
counter is updated with the new address and the next
instruction will be fetched from the new location. If the

CI

EAC queue

Fetch Unit

PIC queue
Decode unit
L

Instruction
status
unit

I.

Buffer units

Issue unit

Buffer units

a
r
r
a

Buffer units 4I*r t

3
Floating point unit

It_l

Fixed point arithmetic


unit
Thick lines represent common data bus.

Fig 2.31

R
e
9
i

The complete pipeline system shown with two streams

e
* r
Logic unit

77

instruction belongs to the class of conditional jump


instructions, the destination address is stored in the
program counter of the EAC stream. The EAC stream is nonfunctional until the PIC stream encounters the first
conditional jump instruction. In the ensuing cycles, the EAC
stream fetches instructions starting from the destination
address computed from the jump instruction. The pre-fetched
instructions are stored in the EAC queue which is present
in the decode unit. The validity of the jump instruction
will determine the condition of the streams. If the jump is
negative, the EAC queue is flushed and the PIC stream
remains the current stream. If the jump is positive, the EAC
stream becomes the current stream and the PIC stream is
flushed. There is no delay because the next instruction is
available in the EAC queue. The EAC stream remains current
as long as no branch instructions are encountered. If a
branch instruction is encountered, the PIC queue will start
filling up with the instructions from the address provided
in the branch instruction. The afore mentioned scheme will
operate with a single program counter for each stream, when
there are no multiple jump instructions encountered in the
streams, before the current branch instruction is evaluated.
In general case there could be multiple jump instructions
encountered by the fetch unit in both the streams, while
forwarding the instructions to their respective queues. Even
though the decode unit and the issue unit are disabled by

29

lnstructions
starting from
address 23 in
memory

EAC stream

Jump (Result = 0) 60

28

,27
26

Jump (Carry = 0) 45

25

Jump (Overflow = 0) 36

24

Jump (Carry = 0) 28

23

Instructions
starting from
address 10 in
memory

16

Jump (Result = 0) 80

15

Jump (Overflow = 0) 70
-

14

Jump (Carry = 0) 56

13
12

Jump (Overflow = 0) 33

11

PIC stream

10

Jump (Carry = 0) 23

Fig. 2.32

Sample instructions in memory

79

a branch instruction, the fetch unit will remain active


until queues in the decode unit are filled. Consider the
instructions in memory as listed in Fig. 2.32. Let n, be the
jump instruction encountered by the PIC stream. The program
c o u n t e r of t h e EAC stream is initialized w i t h t h e
destinationaddress. Two instructions are fetched from the
next cycle, one for the PIC stream and the other for the EAC
stream. Branch instructions m, and nz are encountered
s i m u l t a n e o u s l y by t h e s t r e a m s . T h e f i r s t b r a n c h
instructionn, is currently in the issue unit being
scheduled. Instructions cannot be pre-fetched from the
destination addresses of either m, or n2. A total of four
s t r e a m s a r e r e q u i r e d t o p r e f e t c h t h e n e w s e t of
instructions. It is not possible to flush any of the streams
as the jump instruction n, is not evaluated.

The jump

instruction cannot be forwarded to the decode unit as the


decode unit does not have the ability to generate an
effective address. Assuming n jump instructions in the PIC
stream and m jump instructions in the EAC stream have been
identified by the fetch unit before it is disabled, a tree
can be formed to illustrate the possible logical paths. For
example, let m

= 4

and n = 5. The tree is formed in Fig.

2.33. The parent node is the current jump instruction that

is being processed. The paths to the left indicate the jump


is valid and the paths to the right indicate that the jump
is invalid. The child nodes are the branch instructions

PC = Program counter

The jump instruction n is being evaluated in the logic unit. The issue
unit and the decode unlt have suspended operations until the jump is
evaluated

Fig. 2.33 Graphical representation of data path due to branch instruction

81

belonging to both the streams. Starting from the parent


node, the branches to the right or left are deleted as each
node in the path is evaluated. If the jump is valid then the
branches to the right are eliminated along with the child
nodes connected by the branch. Assuming that the branch is
taken by the parent node, the node m, becomes the parent
node for the first branch in the new path. It is evident
that the next branch instruction that has to be evaluated
is directly dependent on the present branch instruction. It
is not possible to accurately indicate the outcome of a
branch instruction until it is evaluated. So when more
branch instructions are encountered in both the streams, the
combination of the program paths is equivalent to

2^n where

n is the number of branch instructions. With a single


program counter for each stream, pre-fetch cannot be carried
out until one of the streams is flushed. The destination
address would be lost if the jump instruction is forwarded
to the decode stage un-processed. The instructions along the
same stream can be accessed by default without changing the
stream. The opposite stream will become the program path,
when the jump instruction to be evaluated initiates a jump.
Hence the destination address of the jump instructions that
await evaluation in the PIC queue must be associated with
the EAC stream. Additional counters which are associated
with the EAC stream record the destination addresses before
they are forwarded to the decode unit. When the jump is

82

taken, the E A C queue becomes the current queue and the


destination address of the branch instructions in the EAC
queue must be associated with the P I C queue. Hence the
destination addresses are held in the additional counters
associated with the P I C stream. This scheme aides the
pipeline system to reduce the branch penalties. In our
example, the destination address of n2 is held in the
counter 1 of the present E A C stream and the destination
address of m2 is held in counter 1 of the present P I C
stream. If nl is positive then E A C stream becomes the
current stream. The P I C stream is flushed and pre-fetching
can be started in the next cycle as the address is available
in counter 1. Similarly, if the branch instruction nl is
negative, the present P I C stream remains as the active
stream. The EAC stream and queue is flushed and prefetching
starts in the next pipeline cycle by using the address of
n2 in counter 1. The Figures 2.34 to 2.36 illustrate the
sequence of operation assuming that the branch is taken and
ml is the new parent node. The instructions starting from
the address #28 are fetched by the P I C stream.

A flow chart

depicting the events is shown in Fig. 2.37. The new fetch


unit is illustrated in Fig. 2.38.
2.4

HARDWARE SYSTEM:

The pipeline system is designed at the system level


with the individual units of the P I U and the P E U . The
complete system is shown in Fig. 2.39. The individual units

Instructions strating from address


23 in memory

Instructions strating from address


10 in memory
16

15
14

Jump (Result = 0) 80
Jump (Overflow = 0) 70
Jump (Carry = 0) 56

'1 3
12
11

Jump (Overflow = 0) 33

10

Jump (Carry = 0) 23

EAC stream

PIC stream

14

PC. 15

PC 1 60

Fig. 2.34 Sample instructions and the possible data paths

EAC stream counter


Fig. 2.35

PIC stream counter

The contents of the counter after fetching the last instruction in both the sbeams

Assuming the jump is being evaluated in the logic unit

Instructions strating from address


23 in memory

EAC stream -b PIC stream

Instructions strating from address

XXX in memory

PIC stream

EAC stream

PC = Program counter
The jump instruction n has been evaluated and the branch is taken. The old
PIC stream is the redudant stream and hence it is flushed.
Fig. 2.36 Sequence of updating the counters during the jump operation

Instructions from memory

Counter 1. (PIC I EAC counten)

Path selector and controller

Control Paths

disable individual
streams
instructions to the
EAC queue

Fig. 2.37

The fetch unit

instructions to the
PIC queue

4
b

No

t
Fetch instruction

Unconditional
Place in PIC queue

PC(P1C) <-- PC

+1
EAC stream in

Load the EA into


the first available
empty counter of the
EAC stream

Load the EA into


the program counter
of the EAC stream

Yes
Start The EAC
stream

u
Fig. 2.38 Flow chart for the PIC queue assuming PIC queue is in session

Clear the program


counter and the associated
counters 1-n of the

Clear the program


counter and the assodated
counters 1 -n of the
EAC field

Load program
counter with contents
of counter 1

Move the contents of the


counters one counter to
the left

Yes
b

Fig. 2.38 The flow chart of the PIC queue assuming PIC queue in session (cont'd)

No

Fetch instruction

Unconditional
Place in EAC queue

PC(EAC) <-- PC + 1

dad Ihe EA into


;he first available
mpty counter of tbe
PIC stream

Load the EA inro


the program counter
of the PIC stream

Yes
Start The PIC
stream

Fig. 2.38 Flow chart for the EAC queue assuming EAC queue is in session (cont'd)

No

Yes

Are the counters

Ciear the program


counter and the associated
counters 1-n of the
PIC field

Clear the program


counter and the assodata
counters 1-n of the
PIC field

Load program
counter with contents
of cwnter 1

fb
'
h

Move the contents of th6


counters one counter to
the left

yes

Fig. 2.38 The flow chart of the EAC queue assuming EAC queue in session (cont'd)

rn
Main Memory Module

Fig. 2.39

The proposed look-ahead pipeline computer system

90

are provided with local controllers which are responsible


for the functioning of each unit. The local controllers can
communicate with each other. The system contains five
general purpose registers: R,, RZ, R3, Rq, and R5. Data from
and to these registers are transferred by the common data
bus. Each register is associated with a program status
register that represents the condition of the value in the
register. The instructions enter the pipeline system through
the fetch unit. The address of the instruction to be fetched
is issued by a counter referred to as the program counter,
present in the fetch unit. The opcode of the newly fetched
instruction is checked to determine whether the instruction
is a branch instruction. The non branch instruction is
passed unchanged to the next stage. The branch instruction
is further classified as a conditional or an unconditional
branch instruction. For an unconditional branch instruction,
the program counter is updated with the destination address
from where the instructions are fetched in the pipeline
cycles that follow. The handling of conditional branch
instructions is explained in section 2.4.1. The fetch unit
can fetch two instructions simultaneously to reduce the
branch overheads. The individual data paths of the fetch
unit are termed as instruction streams. The current
operational instruction stream is determined by the logic
unit. The switching of streams is carried out by a path
controller in the fetch unit. The control information from

91

the logic unit is fed to this unit which in turn determines


the current stream. The instruction is forwarded to the
decode stage. The decode unit consists of a local FIFO queue
and an instruction decoder for each of the two streams of
the fetch unit. The instruction first enters FIFO queue and
then reaches the decoder.

The current operational stream

is determined by the logic unit and is the same as the fetch


unit. The individual streams of the fetch unit are disabled
if the corresponding queues in the decode unit are filled.
The decode unit splits the instruction into its fundamental
components namely the source operands, destination operands,
and the operation involved. This information is recorded in
the instruction status unit which is a part of the decode
unit and is common to both the streams. The function of this
unit is to supply information to other units about the
specification of the present instruction. The instruction
status unit is a two dimensional array that records the past
and the current history of instructions executed in the
pipeline system. The decode unit is controlled by the logic
unit and is disabled when a branch instruction is being
evaluated in the execution unit. The unit is explained in
section 2.4.2. The decoded instruction is forwarded to the
issue unit after all the relevant information about the
instruction is recorded in the instruction status unit. The
issue unit checks the instruction status unit to determine
dependencies between the current instruction and the

92

instructions that have been issued to the execution unit.


The issue unit schedules the execution time of the
instruction for resolving the hazards. The delay time is
calculated on the information provided by the instruction
status unit. The instruction is set to a certain format and
sent to the execution units. The issue unit is described in
section 2.4.3.

The delayed instructions are held in buffers

until the hazards are resolved. The delay stations monitor


the registers to capture the missing operands as they become
available. Data transfer from and to the registers is
carried by the common data bus. The arithmetic instructions
are executed by the arithmetic units and the logic
instructions are executed by the logic unit. These units are
also provided with controllers that monitor the units to
resolve structural hazards. The instructions are initiated
into their units when the appointed time slot has arrived.
The branch instruction is held in the logic unit until it
is resolved. During this time the issue unit along with the
decode unit is disabled. The fetch unit is not dependent on
the execution unit but is dependent on the condition of the
queues in the decode unit. The controllers between the
various units communicate with each other via the common
system control register which has fields associated with
each unit. These fields are write only fields for the
designated unit and read only for the remaining units. The
total system is illustrated in Fig. 2.40. The individual

Common data bus

Main memory unit

L
.

Counter set 1
Counter set 2

dassifier
---------- Opcode
and EA generator

Fetch unit

Path selector and antroller

I
1

Decode units

PIC Queue
system status units

Decoder
unit 2
I

Deader
unit 1
I

hstr~ction
status unit

7 7-

issue
unit I2

Issue units

-+

lssue
unit 1

(;'

Logic unit controller

R1

R2

R3

R4

Execution units
--

Fig. 2.40

The overview of the complete system

units are described in the following sections.


2.4.1

FETCH UNIT:

T h e fetch unit comprises of the logic t o fetch


instructions from memory, two sets of counters, an opcode
classifier, an effective address (EA) generator, and the
path controller. The path controller is also the local
controller for the fetch unit. The function of the opcode
classifier is to determine the type of t h e present
instruction. The fetch unit is capable of fetching two
instructions simultaneously. This is done to reduce the
branch overheads. Each set of counters consist of 10
individual counters which are identified as counter
counter 9. The counters referred to as counter

to

are used

as program counters in the individual sets. Each set


supports a single instruction stream. The PIC stream starts
as the current instruction stream. The instruction streams
end up into two FIFO queues in the next stage. The unit is
illustrated in Fig. 2.41. The counters of each stream are
initialized by the instructions passing through the opposite
stream. The counters 1 to 9 are filled with the destination
address held by the branch instructions that are awaiting
execution in the FIFO queues. The fetch unit is disabled if
the queues are filled with instructions. Individual streams
are disabled once the associated queue is full. Branch
instructions are held at the issue unit of the current
stream until the outcome is finalized.

The decode unit and

Instructions from memory

Counter 2. (EAC I PIC counten)

Path selector and controller


for path information

C : Control signals
to disable individual
streams

Fig. 2.41

The fetch unit

96

the issue unit are disabled but the queue is still filled
with instructions. The instructions of both the streams are
classified by the classifier and the various conditional
branch instructions a r e i d e n t i f i e d . W h e n a branch
i n s t r u c t i o n i s encountered in t h e P I C s t r e a m , t h e
destination address is calculated and placed in a counter
belonging to the EAC stream. The appropriate counter is
determined by the number of branch instructions that are
present between the present instruction in the fetch unit
and the branch instruction that is being currently evaluated
in the logic unit. Thusthe counter 1 of the EAC stream is
initialized by the address of the first branch instruction
in the PIC queue, with respect to the branch instruction
that is in the logicunit. At any instant of time there will
only be a single branch instruction, being evaluated in the
logic unit. Counter 2 (EAC stream) is loaded with the
destination address of the second branch instruction in the
PIC stream and so on. In general, the destination address
of the branch instructions are loaded into the counters of
the EAC stream in the same order as their physical presence
in the PIC stream. This allows the EAC stream to store all
the possible destination addresses. In the event of the
current branch instruction not being valid, the EAC queue
is flushed and the counter 0 is loaded with the value in
counter 1, along with the other address moving up one
counter to the left. This is shown in Fig. 2.42. The

s
Instructions from memory
I

Path selector and controller

T7-T-

Control Paths

C :

Control signals
to disable individual
streams

Fig. 2.42

The counters associated with each queue

A : Control signals
for path information

98

same procedure is followed by the EAC stream by loading the


counters 1 to 9 in the PIC stream.

If the branch is taken,

the EAC stream becomes the current stream and the PIC queue
is flushed along with the contents of the counters 1 to 9
of the EAC stream. The counter

of the PIC stream is loaded

with the address stored in counter 1 and PIC stream starts


fetching instructions starting from the new address. The
other address are also moved up by one counter to the left.
The address present in counter 0 corresponds to the branch
instruction in the E A C stream that is being
currentlyevaluated or the first jump instruction that will
be evaluated by the EAC stream. The path selector holds the
identity information and is responsible for loading of
addresses into the counters. The path selector monitors the
decode queue and disables the fetch unit or the individual
streams as necessary. The external control signals that are
needed by the fetch unit are:

change path, disable EAC

queue, and disable PIC queue.


2.4.2

DECODE UNIT:

The decode unit decodes the instruction and identifies


the sink and the source operands. The decode unit consists
of two instruction queues and two decoder units. Instruction
queue 1 is designated as the PIC queue and instruction queue
2 is designated as the EAC queue. The queues are FIFO in

nature. Two different instructions belonging to two


different streams can be simultaneously decoded by their

- field

Field

Exec

Code

Time
R1 R2

a d d 6

R3

R4 R5

C1

C2

C3

C4

C5

Op-code is the opcode of the Instruction.


Time: Time is the time required to execute the instruction.
R - Field :The field representing the registers in the system.
C - Field : Counter fields associated with the registers.

0: Register used as the sink (destination ) register.


1: Register used as the source register.
3: Not used in the instruction under consideration.

Fig. 2.43

Instruction

status

unit.

100

individual decoders. The decoded information is stored in


the instruction status unit. When the instruction status
unit is filled up, the current decoded information is stored
at the beginning of the array. In this manner the old
records are overwritten with the new ones. The roll over is
necessary so as to limit the size of the unit. Both the
streams use the same unit. For example let the instruction
read from the PIC queue be R1

R2

R3. In machine language

mnemonics it would be stated a s A D D R l 1 R 2 , R 3 . R 1 is


thedestination or the sink registers. R2 and R3 are the
source registers. The operation specified is ADD. The
decoded instruction in the instruction status unit would
read as inFig. 2.43. The binary digit 0 represents that the
associated register is used as a sink register. The binary
digit 1 indicates the source registers. The binary digit
three is used as a null variable. The decode unit is
controlled by a queue controller that monitors and controls
the FIFO queues. The controller is assigned the duty of
determining whether a particular queue i s full, in
operation, and flushing the redundant queue. The control
signals, flush queue one and flush queue two are needed by
the controller to flush the queues. The controller puts out
a control signal which indicate the queues which are full.
The unit is illustrated in Fig. 2.44.
2.3

ISSUE UNIT:

The issue unit issues the instruction to the execution

flush Queue 2

Instruction from

INSTRUCTION

Decoded instruction
to issue unit # 2

Decoded instruction
to issue unit # 1

Thick lines represent data lines.


Thin lines are the control lines.
Fig. 2.44 Decode Unit

Opcode

Fig. 2.45

ASRl

DSRl

SD1

ASR2

DSR2

SO2

Instruction format issued to the execution unit

DR

103

units. The function of the issue unit is to schedule the


execution of the instructions. Each stream contains its own
issue unit. At any instant of time, the operating issue unit
belongs to the stream that is current. The issue unit
controls the C field in the instruction status unit and the
delay is set according to information that is available. The
issue unit consists of: 1) logic capable of resolving the
RAW

and WAW hazards, and 2) the instruction router unit. The

issued instruction is formatted as shown in Fig. 2.45 and


is forwarded to the execution unit. The output of the issue
unit is made up of seven fields: 1) address of source
register one (ASRl), 2) operand of source register one
(DSRl), 3) source delay one (SDl), 4) address of source
register two, 5) data of source register two (DSR2), 6)
source delay two, 7) instruction delay (ID) and 8)
destination register (DR). The instruction is fetched from
the current queue in the decode unit and is simultaneously
fed to the main hazard resolving unit. The various delays
are computed and the instruction is formatted to be issued
to the execution unit in the next pipeline cycle. If the
operands are available, the operands are loaded in the
operand data fields and then issued. For example, let the
ADD Rl,R2,R3 be the present instruction encountered by the
issue unit. If Cl,C2,C3 are all zeros and Rl=R2=R3=5 then
the formatted instruction would read as displayed in Fig.
2.46. On the other hand if the Cl,C2,C3 are non zeros, the

Let R1 = 5 and R2 = 5.

+ R3, where R2 and R3 are available.

R1 = R2

OpcOde

ADD

Fig. 2.46

R1 = R2

ASRl

DSRl

SDl

ASR2 DSR2 SD2

R2

R3

DR

R1

Formatted instruction for 'add Rl,R2,R3' with no delay

+ R3, where

R2 and R3 are not available.

The delay associated with R2 and R3 is 3 and 4 respectively

Opcode

ASRl

DSRl

SO1

ASR2

DSRP SD2

ADD

R2

R3

Fig. 2.47

DR

R1

Formatted instruction with delay, forwarded to execution unit

lnstructions
from
EAC Queue

lnstructions
from
PIC Queue

lnstructions to the
Execution unit
N: Update counter fields in system status units.
M: Input of the counter fields from the system status unit
U: Common Data Bus.
V: Disable issue unit signal from logic unit controller.

Thick lines represent data flow


Thin lines are the control lines
Fig. 2.48 Issue

unit

106

delays have to be computed and such an instruction would be


forwarded to the execution unit as shown in Fig. 2.47.
Conditional branch instructions are handled in a different
manner. The issue unit calculates the time when the correct
result would be available forwards it to the logic unit. The
issue unit is then disabled along with the decode unit until
the branch instruction is evaluated. The issue unit is
illustrated in Fig. 2.48.
2.4.5

EXECUTION UNIT:
The execution unit comprises of three sub units namely:

1) dynamic fixed point arithmetic unit, 2) dynamic floating

point arithmetic unit, and 3) logic unit. The fixed point


arithmetic unit is a pipelined unit with seven stages. The
design is based on the carry save adder tree for multiple
additions of binary numbers. The behavior of the arithmetic
unit is dynamic and it c a n execute four different
operations:

add, subtract, multiply and divide without

reconfiguring the pipeline. It can also handle upto three


different arithmetic operations being processed in the
various stages in the same pipeline cycle. The arithmetic
instructions whose operations are multiplication or division
are allowed to enter the arithmetic unit at stage one.
Addition and subtraction instructions are introduced to the
pipeline at stage six. The results are uploaded to
destination registers at stage 7. Arithmetic instructions
issued by the issue unit that do not contain the appropriate

Instructions from Issue Unit.

t
Dynamic
Fixed
Point
Arithmetic
Unit

Dynamic
Floating
Point
Unit

Logic Unit.

*
i

Controller for the

Controller for floating

+
b Controller for

point unit

To common data bus.


Thin lines: Control signals to control the input and output
Thick lines: Instruction flow from issue unit.
Very thick lines: Output data lines to the common data bus.

DS: Delay Station.


Fig. 2.49 Dynamic

pipelined execution unit.

the fixed

108

operands are held at stage 1 or stage 6, depending on the


operation specified by the instruction. The floating point
unit is a repetition of the fixed point unit with some
external combination c i r c u i t r y t o take care o f t h e
additional processing. The logic unit is responsible for the
execution of logic instructions and the evaluation of branch
instructions. Branch instructions that have to be evaluated
are held at DS provided in the logic unit. Additional memory
elements are provided in stages oneand six of the arithmetic
units and stage three of the logic unit. These memory
elements store the instructions that are issued by the issue
unit until they are ready to be executed.The execution units
are illustrated in Fig. 2.49. Ten buffers are available at
the entrances to the execution unit. Every DB has equal
access to the execution unit. Priority numbers are assigned
to every DB in the reservation station. The priority number
is used to determine the instruction that has been waiting
for the longest time in delay station. In case of a conflict
between two incompatible instructions that require the use
of the same stage of the execution unit, the instruction
with the highest priority is executed. The priority numbers
are daisy chained as instructions are executed. Each delay
station operates independently of the others in the system.
The individual execution units are provided with
dedicated controllers which provide collision free execution
of instructions in the execution units. The controllers of

109

the arithmetic unit provide collision free execution of


instructions. The logic unit controller is responsible to
evaluate the pending branch instructions and direct the
instructions into their respective buffers. The controllers
communicate with each other by using a common control
register. The detialed design of the controller is beyond
the scope of this research.
The next chapter deals with the structure and design
of the fixed point arithmetic unit.

CHAPTER THREE
DESIGN OF DYNAMIC PIPELINE ARITHMETIC UNIT
3.1

INTRODUCTION:

The arithmetic functions are executed by the arithmetic


units in the execution unit. The design of these units
determine the throughput of the total system. The arithmetic
units are modelled after the static wallace tree structure
capable of performing multiple number additions. The
advantage is that the architecture is pipelined, and
modifications to increase the computing capabilities are
possible. T h e static unifunction pipeline has t o be
converted to a multifunction pipeline capable of handling
addition, subtraction, multiplication and division.
Individual functional units can be provided, but it leads
to increase and redundancy of the hardware. The design
criteria is to model a single arithmetic unit which is
capable of carrying out executions of different arithmetic
instructions at the same time. The algorithms for performing
the arithmetic operations are chosen so as to complement the
structure of the wallace tree.
The wallace tree is first described and the
modifications are carried out depending on each of the four
operations.
3.2

PRINCIPLE OF OPERATION OF THE CSA TREE:

Multiple number addition can b e realized with a

multilevel tree adder. The conventional carry propagate


adder

(CPA)

adds two inputs and yields a single output. A

carry save adder

(CSA)

receives three inputs and produces

two outputs called the sum vector


(CV).

The

CSA

(SV)

and the carry vector

is a full adder wherein the carry in bit is

treated as an element of the third input vector. The carry


out bit is treated as an output element of the carry vector
and is not forwarded to the next full adder. The sum vector
and the carry vector are treated as individual vectors and
are operated upon in the same manner.

n-bit

CSA

element

consists of n full adders where in the carry in bits of the


individual adders are used to enter the third vector and all
the carry out terminals act as the output lines of the carry
vector. The carry lines are not internally interconnected
in a carry save adder. The truth table of a single

CSA

element is illustrated in Fig. 3.1 along with the

CSA

element.
Mathematically the carry save adder is represented as:
A + B + D = S + C

where

(3.1)

+ is the arithmetic addition operation. The input

vectors are

A,

B, and

D.

The output vectors are

and

C.

The

total sum of the three input vectors are obtained by adding


the

vector with the

vector. The carry vector is shifted

one bit to the left compared to the sum vector. This


shifting of the carry bit is necessary to maintain the
correct placement of the vectors with respect to each other.

CARRY
SAVE
ADDER
UNIT.

SUM

CARRY

Fig. 3.1

The CSA element and its truth table

Fig. 3.2

CSA operation of adding three elements

In the process of summation, the carry bit of the lowest


significant bit is added along with the next higher order
bits. This principle is illustrated in Fig. 3.2. If it is
necessary to perform multiple number additions, then the
individual C S A elements are configured into stages of a
pipeline. The process of adding n vectors, m bits long is
carried out as follows. The input binary vectors are divided
into k groups consisting of three vectors. If the number of
vectors is not a multiple of three then the value of k is
equal to the highest number of groups that are possible. The
number of C S A elements that are required to start the
process are equal to k, m-bit CSA units.

The CSA elements

which perform the operations in parallel are grouped


together into a stage or a level. The ungrouped vectors are
passed undisturbed to the next stage.

The aim is to merge

the n input vectors into two vectors S and C, each 2*m bits
long. The process of merging is carried out in stages. The
number of C S A elements in a stage is equal to the highest
number of three vector groups that are possible from the
input vectors to that stage. The ungrouped vectors are
passed on to be processed in the next stage. The final
result is obtained by adding the last sum and the carry
vector. T h e relative order o f t h e vectors has to be
maintained through out the pipe so as to obtain the correct
result.
Let the eight vectors shown below be the shifted

114
multiplicands of two eight bit binary vectors wherein the
operation between them is multiplication. These partial
products are to be added to obtain the final product and
hence involve multiple additions. The leading and trailing
zeros are added to show the relative displacement of the
vectors to each other.
W1

0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 1

W2

0 0 0 0 0 0 0 1 0 0 0 1 1 1 1 0

W3

0 0 0 0 0 0 1 1 1 0 0 0 1 0 0 0

W4

0 0 0 0 0 1 1 0 0 1 1 0 0 0 0 0

W5

0 0 0 0 1 1 0 1 0 0 1 1 0 0 0 0

W6

0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0

W7

0 0 1 0 1 1 0 1 0 0 0 0 0 0 0 0

W8

0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0

W l t o W8 are the partial products of binary


multiplication of two vectors and in this example they are
treated as 16 bit binary vectors.
If we restrict the operation of the CSA tree to adding
multiple numbers, a wallace tree can be structured. In
general, a V

level wallace tree can add upto N(v) input

numbers, where N(v) is evaluated by the following recursive


formula [23] :
N(v)

0.5

N(v)

N(v-1)

mod 2 ) )

(3.2)

with N(1) = 3.
For example, we need 10 CSA tree levels to add 64 to 94
numbers in one pass through the tree.

M a t h e m a t i c a l l y f o r e i g h t o f t h e e i g h t b i t v e c t o r s we
need a f i v e l e v e l CSA t r e e . The l e a d i n g z e r o s a r e o m i t t e d
f o r t h e c a l c u l a t i o n s . The p r o c e s s of a d d i t i o n is i l l u s t r a t e d
b e l o w w h e r e i n SV r e p r e s e n t s t h e s u m v e c t o r

a n d CV

r e p r e s e n t s t h e c a r r y v e c t o r . The number of groups of t h r e e


v e c t o r s t h a t can be formed a r e two i n number (k=2) and hence
two e i g h t b i t CSA u n i t s a r e r e q u i r e d t o s t a r t t h e p r o c e s s .
Level 1:

The f o l l o w i n g i s t h e o p e r a t i o n i n e i g h t b i t CSA

u n i t #1:

The f o l l o w i n g i s t h e o p e r a t i o n i n e i g h t b i t C S A u n i t
#2:

A t t h e end of l e v e l one t h e r e s u l t s a r e t a b u l a t e d i n
t h e i r c o r r e c t order:

These v e c t o r s a r e forwarded t o t h e l e v e l 2 f o r f u r t h e r
processing. A t l e v e l two t h e r e a r e s i x b i n a r y v e c t o r s t o be

added. The number of

three vector groups that are possible

are two (k=2). They are 1) SVl,CVl,SV2 and 2) CV2, W7,W8.


The operation of the CSA units three and four are as
follows:
Level 2:

CSA unit number #3:

CSA unit number #4:

The results of level two are tabulated below :

These results are forwarded to level three

In level three, the three inputs SV3, CV3,and SV4 are


converted into two outputs which are then used to compute
the result. The operation of CSA unit five in level 3 is as
follows:

The results of level 3 are tabulated below:

The above results are forwarded to level 4 along with


CV4.

Level 4 :

The above vectors are forwarded to level 5 for the


final summation.
Level 5:

SUM

1 0 1 0 0 1 1 1 1 1 0 1 1 1 1 1

The complete structure of the pipelined unit is shown


in Fig. 3 1 3 . Anderson et al.

[22]

has used this concept and

modified it to suit the needs of IBM System 360/Model 91.


3.3

CONVERSION OF UNIFUNCTION STRUCTURE TO MULTIFUNCTION

STRUCTURE:

The structure discussed is a static unifunction


pipeline. It can carry out only the additions of multiple

44 C

C C

C C C
4 + 4

.:................................
:.:-:.:<...:.:.;<.:.:.:.:.:.;...
.-:;.:...
.;.;<...>. ,...
2,. ..............................
.......:.:?~::.':.y<::*:::::::::~:~~::::i:::*.:
.::.:~:,:>;<:i~>>;i:$i~iyj:.>j~p*::;<:::$*:.:j$:>ii:j:;::;:~i:i<>:*':jj<:.$:j:~>*:::;<:::
;

44 4
CSA UNIT

S1

-1

CSA UNIT

LATCH 1

-2
LATCH 2

S2

C S A UNIT

-4

C S A UNIT

+ .

-3

C4
S4
c 3 T
s3
................................................................................................................
.,..
:..-.:.::;:::::::
>.: .,::::;.,.
...................................
. . . . . . . . . . . . . . . . . . . . . ............................................................................
. . . . . .................
. . . . - .'.:<.:.>:;.............
. . . .....................
. . . . . .:s,;.
. .......
,
.., .:.::...:::,:::.:. -:-::>::.:>::.:;.
,

LATCH 3

4
S3
C S A UNIT - 5
S5
w ..................
C5
4 ..........
'. .......................
..................
........
.;.;. ..........
.......... .......
...............................................................
;.
..........................................................
.............. .............................
...>
....................................................
...................................
4

,,

...:;
..?

<<,,.x:A::::-:.:.?..

...................
...................
&.

""'.'i

4 4

~,

CSfl UNIT
C6

:::: :*

:.:.:.:.;.:.:

:::::;:y:.:::.:.:.
..................................................
:::,:
::l.:.:,.>:.:...
:.:.
..:.:
:,,:.
.:..
,:.:
:.::.
,..:.
..::>:

LATCH 4

-6

S6
-. . . . . . . . . . .

...........................................
............
. . . . ...........
. :........o:.;r.?,,y+x. .:............................
::::.:.:.A* . :...................................................
. ..r.. :.:.,.> ...........:, . ....i;.. ..: .r............
. :... . . . . . . . . .

LATCH 5

LATCH 6

TOTRL SUM.

Fig. 3.3

CSA

structure to s d d eight, 8-bit binary uectors

119

numbers which are N bits wide. The main aim of the design
is to modify the static structure to support the operations
of addition, subtraction, multiplication and division.
3.3.1

MODIFICATIONS DUE TO ADDITION AND SUBTRACTION:


The last stage of the CSA tree structure can support

addition and subtraction. The addition can be carried out


in the adder unit which sums up the two final vectors from
stage 5. Hence a path has to be created to load the two
vectors from the external sources. A multiplexer is
introduced to choose between the two streams. The changes
are illustrated in Fig. 3.4. The subtraction is carried out
using the two's complement method which means the number to
be subtracted has to be inverted on demand and a value of
one should be added to the least significant binary digit.
The operation of inversion on demand can be achieved by
using XOR gates and controlling one of the inputs. Hence a
XOR gate array is attached to one of the branches of the
external input data stream as shown below in Fig. 3.5.
3.3.2

MULTIPLICATION:

T h e

o p e r a t i o n

of

multiplication that is being attempted is very similar to


that of the decimal multiplication. When two binary vectors
of lengths m and n are to be multiplied, the final product
will be a vector with the maximum length of (m + n). Let a
vector A of length n be multiplied with another vector B of
length m. Each member of the vector B namely bj (for all

Operand #I
-V

MULTIPLEXER

Operand #2

External Data
Streams.

C A R R Y LOOK flHEAD

RDDER
I

Carry in

Fig. 3.4.

Operand #1

Changes to the pipeline due to addition

MULTIPLEXER

Operand #2
External Data
Streams.

C A R R Y LOOK AHEAD
RDDER

Select line
f o r inversion.

Fig. 3.5.

Carry in = 1

Changes to the pipeline due to subtraction.

121

j=O,m) is multiplied with each member of vector A namely ai


(for all i

0,n) to produce m such vectors called the

partial product vectors. The process is illustrated in Fig.


3.6 for m equal to 6 and n equal to 6.

p9

P1O

8'

p7

6'

p5

p4

p3

p2

Pl

Po

.........................................................
P

= {

PlOP9 P8 P7 P6 P5 P, P3 P2 PI Po ) = Total product.

Fig. 3.6.

Multiplication by multiple additions.

The values of i and j were chosen to be 6 and hence the


product vector has eleven elements. This process requires
six shifts and six adds to get the product. The partial
products are necessary to compute the result and they must
be shifted according to the weight of its multiplier namely
bj.

SHIFTED MULTIPLICAND GENERATOR.

W
1

W
2

W
3

W
4

W
5

W
6

W
7

W
8

Fig. 3.7 Changes to the pipeline by multiplication

123
MODIFICATIONS DUE TO MULTIPLICATION:

3.3.3

A stage is added at the top of the pipe to calculate

the shifted multiplicands of the input binary vectors. This


stage calculates the partial products of the two binary
vectors that are to be multiplied and presents them as
multiple vectors. The new stage is shown in Fig. 3.7.
3.3.4

DIVISION:
The division process is different from that of the

usual shift and subtract operation. The principle is based


on converting the shift and subtract, to shift and add
operation. In simpler terms, the division operation is
converted into a multiplication operation.
This operation is called the convergence method of
division and it is used in IBM 360/370 and CDC 6600/7700.
The method is described briefly below. We want to compute
the ratio (quotient) Q = N/D where N is the numerator and

D is the denominator. This process is being carried out in


the normalized

binary arithmetic form. Hence (0.5 <= N <

1) and (0.5 <= D < 1). In the original method N is always


less than D but this has been modified to accommodate even
if N > D. The only restriction placed on this method is that
both N and D must be normalized before any of the operations
can begin.
L e t R i f o r i = 1 , 2 , 3 , ..

. . . .n be

t h e successive

converging factors. One can select


(i-1)

R,

g 2

for i

1,2,. ..,k

where

= 1

D a n d 0 < 8 <=

0.5.

To evaluate the quotient Q we multiply both N and D by


R i , starting from i = 1 until a certain stage, say k.

Mathematically, we have as follows:

The value of the denominator D is substituted with 1 - 6 and


the resulting equation is shown below

((1-6)x R x R x R x
1
2
3

. . . . .

.X

(i-1)

Expanding Ri in terms of ( 1 + 6
i

= 1, 2 , 3 ,

given below:

as in equation ( 3 . 3 ) for

..... k t the above equation is modified

as

The denominator can be reduced to one term as shown in


the following:
2

N x (l+S)x (l+6 ) x (l+S

The value of

.x

1.

(l+6

(6) cannot execeed

(k-1)
)

0.5. Hence, the

*(k)

denominator term ( 1-6 ) will tend towards unity when the


value of k is sufficiently large. For an eight bit machine,
t h e accuracy of 0.996 can be achieved within t h r e e
iterations (k = 3). Thus the equation is approximated as
follows

A table is given below tabulating the convergence

sequence for the maximum value of

Iteration #
( k )

= 0.5

2k
6 2k

(1-

0.25

0.75

0.0625

0.9375

0.003906

0.996094

1.526 x 1 0 -

2.32 x 10

'

0. 9999874
O

0. 9999999

There will be an overflow if N > D and this is taken


care of as follows. Since both N and D are bounded by the
limits 0.5 and 1 where 0.5 is the lower bound and 1 the
upper bound. N and D can assume the value of 0.5 but cannot
assume 1. Hence if N > D then N/D can be represented as I1

f r a c t i o n ' w h e r e i n t h e f r a c t i o n i s l e s s t h a n 1.

Mathematically we have
Q

where B = N

N/D = (D + B)/D

(B/D)

D. The operation of B/D is carried out by the

convergence method. The total result can be obtained by


initialising an overflow bit. This overflow bit must be
taken into consideration, when the result of this operation
is required for further operations.
3.3.5

MODIFICATIONS DUE TO DIVISIONS:

The implementation of division by convergence method


using the wallace tree is carried out by splitting the
process into iterations. Each iteration computes the new
partial product and the number of iterations depends on the
convergence factor DELTA. The process is explained below
mathematically.

L e t Pp

L e t Pg

= -P2x (

P ~ X(

2
+ b )
4
1 + 6 )
1

( 3 8)

(3-9)

substituting the value of P3 in the equation (3.6)

127

It is easy to see that the pipeline has to be modified


to achieve division operation. From the discussion above,
the partial product Pk is calculated by placing Pk-, and
as the two input arguments to the pipeline. In the
first iteration, the convergence factor is easily calculated
but in the following iterations the power of delta rises by
the power of 2k

. Hence

the higher powers of delta for the

next iteration is calculated by multiplying the present


value of delta with itself. The purpose of calculating delta
is two fold: 1) for providing the second argument for the
next iteration at the top of the pipe and 2) to find out
whether the convergence has been achieved. The calculation
of the second argument for the next iteration involves delta
being calculated twice at stage six in consecutive clock
cycles. The second delta is used for testing the convergence
and also to realize the next higher power of delta.
The process of calculating the value of delta twice in
stage six is achieved by placing a latch in stage 5 and
holding the value of delta for an extra clock cycle. Hence
there will be an addition of a new latch to the pipeline at
stage five. There is another change that has to be carried
out to ensure the successful operation of division. The
partial product that comes out of stage six cannot be fed
back into the pipe for the next iteration because the second
argument is not yet available. Hence another stage is added
to hold the value of the partial product until the second

128
argument becomes available. The changes made are shown in
Fig. 3.8. Thus the pipe is converted from a unifunction pipe
to that of a multifunction pipe capable of"dynamic behavior
as shown in Fig 3.9. The dynamic operation depends on both
the hardware and the control schemes for its successful
operation. The control is based on the hardware and the
details of how the control was realized is explained in the
next section.
3.4

DYNAMIC EXECUTION OF INSTRUCTIONS :

The dynamic scheduling of data in a multifunction


pipeline is essential for the successful operation of the
pipeline. The scheduling algorithm maintains collision free
execution of instructions in the pipeline system. The
development of such procedures has been studied by several
researchers. Ramamoorthy and Li [15] have shown that general
problem is intrinsically difficult
NP complete class of problems

and is a member of the

It is conjectured that any

such problem of this class has no fast solution, that is,


a scheduling algorithm is not a polynomial function of the
number of items to be scheduled. Ramamoorthy and Li [16],
Ramamoorhty and Kim [lo], and Sze and Tou [17], have studied
the suboptimal scheduling algorithms and their
characteristics with mixed results. The work performed by
Davidson [18], Shar [19], Pate1 and Davidson [20], and
Thomas and Davidson [21] is taken as the foundation and the
scheduling of the dynamic pipe is developed from it.

CSfl UNIT 6

MULTIPLEHER

MULTIPLEXER

CARRY LOOK AHEAD

Fig. 3.8. Changes to the pipeline due to division.

LATCH 2
CSA UNlT

S2

-2
LATCH 3

S3

-4

CSA UNIT

CSA UNIT

- . . .-

-3

...

..:.

::+::.:::,,,:
. .::;::::I:.:;5:::I:::j:;:;:::;:::i::: . . .,j:::::::;:::;:F:....
...
.........................
.........................................
:~~j~:i:::~~~~:~:~~~~~:~.wlj.:::~~3j~~~~j.;::P,:i'::::~:2~::~jj~~,~:,,~:~:~:~~~~:~:~::t;j.:~~~~:~I~~~~:.;.~~j
..........................

I
$4

CSR UNIT

CSR UNIT- 5

............

-6

Inverter
CARRY LOOK
AHEAD ADDER

C o n y in.

P= A'B.

Fig. 3.9

Multifunction eightbit arithmetic pipeline unit.

LATCH 4

131
3.4.1

COLLISION VEXTORS:

Collision vectors are binary vectors that are derived


from the latencies of a given pipeline. Latency is defined
as the number of pipeline cycles between two successful
initiation of instructions in the pipeline. Initiation is
defined as the process wherein an instruction is fed at the
input s t a g e o f t h e pipeline system. An initiation
corresponds to the start of the computation of

a single

function. The latency is an integer and is bounded


theoretically from 0 to infinity. Static pipelines are
forbidden to have a latency of 0 as two simultaneous
initiations cannot be performed. The latency can also be
derived from reservation tables. The reservation table is
a two dimensional array representing the usage of the
pipeline stages by the function. It represents the flow of
the instruction from the first stage to the last stage.
Every function has its unique reservation table. The latency
of a given function is determined by using two reservation
tables of the same function. The first reservation table is
shifted one clock cycle to the left and placed on the second
reservation table. If there are any common stages between
the two tables, there is a chance of collision and that time
cycle is a forbidden latency. The shifting and overlaying
is carried out for all the pipeline cycles that the
instruction remains in the pipeline. A static pipeline is
capable of executing only a single function.

132

The collision vectors are unique for any given


function. The cross collision vectors depict the times of
all possible initiations within a time period. It is used
to schedule the instruction flow in any pipeline. They are
derived as follows:
The length of the collision vector of a function is
equal to the difference between the time it is initiated and
the time the final result is obtained. If a function is
present in the pipeline for 10 pipeline cycles, the
collision vector would be 10 bits long.
The elements o f t h e binary vector are assigned
according to the latency sequence. All available latencies
are assigned as 0s and all forbidden latencies are assigned
as Is. The latency sequence is derived by initiating the
function into the pipeline and calculating all the available
latencies for initiating the same function. The available
latencies are bounded by the maximum time that the initiated
function remains in the pipeline. If the function remains

in the pipeline for 10 pipeline cycles, the latency sequence


can be at the most, 10 elements. If the latency sequence of
a pipeline is (0,1,4,6,8) out of a possible 10 clock cycles
then the collision vector will be

00110101011

).

Collision

vectors are used to derive the cross collision matrices for


the dynamic pipeline. The collision vector for the
reservation table in the Fig. 3.10 is listed below:
c l = ( 1 1 1 1 0 0 1 0 )

STAGE 4
STAGE 5

Fig. 3.10

X
X

A sample reservation table

134

Based on the above concepts, the scheduling of the


pipeline is derived in the next sections.
3.4.2

DESIGN OF CROSS COLLISION MATRICES:

The scheduling of instructions in a dynamic system


cannot be performed by a single collision vector. This is
due to the fact that more than one function is executed in
the pipeline and a single vector cannot represent all the
available latencies of all the functions. The scheduling of
data in a dynamic pipeline is carried out by using a matrix
instead o f a vector. T h e m a t r i x i s c a l l e d t h e c r o s s
collision matrix. The design of the cross collision matrix
is based on four collision vectors. These collision vectors
belong t o t h e four arithmetic operations: addition,
subtraction, multiplication, and division. The collision
vector for the individual functions is derived on the basis
of the reservation table associated with each function. The
collision vectors for the four functions are derived below:
The operation of both addition and subtraction are very
similar to each other. The only difference between addition
and subtraction is that in subtraction, one of the operands
is inverted and the carry in is equal to 1. All this can be
achieved in one clock cycle and therefore no additional
stage is required. The reservation table for addition and
subtraction is shown in Fig. 3.11.
The collision vector for the function of addition and
subtraction is as follows:

135
cad,

={

1 0 0 0 0 0 0 0 0 }

(3.11)

'sub

={

1 0 0 0 0 0 0 0 0 }

(3.12)

The multiplication operation is initiated at the top


of the pipeline. The function once initiated passes through
t h e v a r i o u s stages t o t h e end o f t h e pipeline. T h e
reservation table for the operation of multiplication is
given in Fig. 3.12. From the reservation table it is found
that the latency is 1 and that the collision vector closely
resembles that of addition and subtraction. The collision
vector for multiplication is as follows:
Cmlt

1 0 0 0 0 0 0 0 )

(3.13)

The collision vector for division involves two sets of


computation. The first set is the generation of partial
products and the second set is the generation of the new
convergence factor. This is due to the fact that from
equation (3.5). we have in the convergence method, two
products to be calculated for each iteration. Recalling the
convergence equation (3.3) we see the need to calculate the
new quotient and the convergence factor for each iteration.
The reservation table for this function is the
combination of equations (3.4) and (3.7). First, the partial
quotient is calculated by introducing one of the arguments

N as one of the inputs and the second input is 1+

This

gives the new partial quotient namely N* (1+

is illustrated in Fig. 3.13. To calculate

^2 for the next

iteration, the value of

This process

is initiated as the two inputs

Stage
Stage
Stage
Stage
Stage
Stage
Stage

Fig. 3.1 1

Reservation table for addition and subtraction.

Stage
Stage
Stage
Stage
Stage
Stage
Stage

Fig. 3.1 2

Reservation table for multiplication.

137
which immediately follows the initiation of the previous
function. This is carried out to have all the operands
available for the next iteration without any undue delay.
The value of delta is held in stage five for two consecutive
clock cycles because of the necessity of obtaining the new
values of 1

)A

and (

)A

. This is illustrated in

Fig. 3.14. In the reservation table the flow of the partial


products is marked with a X and that of delta is marked with
an 0. Even though they are two different subfunctions of the
same main function they are combined together into one
reservation table. The reservation table for division is
shown in Fig. 3.15. The collision vector for division is
given below:
Cdiv = {

1 1 1 0 0 0 0 0 )

(3.16)

DESIGN OF THE CROSS COLLISION MATRIX :

A cross collision matrix is a r x d binary matrix where

r is the number of reservation tables and d the maximum of


the clock cycles of all the tables. In our design the value
of r is 3 and the value of d is 8. The cross collision
matrix represents a state of operation of the pipeline. The
steps of designing the initial cross collision matrices are
as follows [23]:
Step 1:

There are r initial states for the r reservation

tables. The table i which assumes the first initiation at


clock cycle 0 is of the type i.
Step 2:

The jth row of the ith matrix CMi is the collision

Time

Fig.

3.13

Reservation table for partial product of N

(I+6

(k)

Fig. 3.14

2
Reservation table for delta products. ( 6

).

Fig. 3.15

The reservation table for the convergence method

140
vector between an initiation of type i and latter initiation
of t y p e j . T h u s C M i ( j ,k) i s 0 only if shifting the
respective reservation tables j , k places right, and
overlaying them on a copy of the reservation table i f
results in no collision. k denotes the number of clock
cycles from the initial clock cycle 0, when an initiation
of the function j is desired.
Step 3:

In all cases the ith row of CMi is the same as the

initial collision vector o f t h e function i.

I t is

equivalent to the reservation table i used in a static


configuration. The other rows are called cross collision
vectors.
Fig. 3.16. shows a sample initial collision matrix for
an operation i. The number of rows depends on the total
number of distinct operations that can be performed by the
pipeline system. In our research the pipeline can perform
three distinct operations and hence the number of rows are
three in number. Let the sample matrix represent the initial
collision matrix of operation divide which is tagged as
operation number 1. Row1 is the initial collision vector of
division operation. Row2 is the cross collision vector
between operation 1 and operation 2. Row3 is the cross
collision vector between operation 2 and operation 3.
The initial collision matrix for division operation is
derived f r o m t h e collision vectors of addition

subtraction, multiplication and division. The formulation

The number of columns is equal to maximum compute time of operations 1 to n

Cross collision vector between operation (1) and operation i


Cross collision vector between operation (2) and operation i

row 1

Cross collision vector between operation (3) and operation i

row 2
row 3

Collision vector for the operation (i.)

row i

row n-1
Cross collision vector between operation (n-1) and operation (i)l
Cross collision vector between operation (n) and operation (i)

Total number of operations that can be performed by the system = n

Fig. 3.16

Structure of an initial cross collision matrix for operation i.

142
of the initial collision matrix for division is chosen to
illustrate the process. The operations are to be tagged as
i, i+l

, i+3 and so on. Each operation has its own initial

collision matrix. In our research the operation of division


is tagged as 1, the operation of multiplication is tagged
as 2, and the operation of addition and subtraction are
tagged as 3. The assignment of the tags is of no consequence
and they can be assigned as desired. Care should be taken
to assign the same numbers to the initial collision matrices
which are associated to each operation. In this case, the
collision matrix 2 represents the initial collision matrix
for multiplication. The rowl of the matrix 1 should be the
cross collision vector of operation 1. The row 2 of matrix
two should also be the cross collision vector of operation
2. T h i s is the same case with all t h e initial cross

collision matrices.
The rowl of the matrix 1 will be the cross collision
vector of division operation. The elements of rowl are

1 1 0 0 0 0 0 ) . The row 2 represents the cross collision

vector between multiplication and division. The elements of


this row are obtained by s l i d i n g t o the right the
reservation table of multiplication in Fig 3.12. over the
reservation table of division in Fig. 3.14 and determining
the available latency sequence. The available latency
sequence between division and multiplication are
)

and hence the elements of row 2 are

3,4,5,6,7

1 1 1 0 0 0 0 0 ).

143

Using the same process the cross collision vector between


addition and division is obtained and the elements in row
3 are { 0 0 0 0 0 1 1 1 ) . The resulting matrix is given in

Fig. 3.17.
In the proposed pipeline system, there are three
distinct functions which produce three initial collision
matrices and they are presented in Fig. 3.18. The state
diagrams are generated using the initial collision matrices.
GENERATION OF STATE DIAGRAMS:

3.4.3

The state diagrams represent the condition


of stage utilization of the pipeline at any instant of time.
The state diagram gives the controllers useful information
about the state of the system. At each pipeline cycle, the
pipeline configuration corresponds to one of the states.
The generation of state diagrams follows the steps given
below:
Step 1:

Each initial cross collision matrix is a single

state. The initiations are controlled by the elements of


individual columns.
Step 2:
column

The next state is determined by looking at the


of the present collision matrix. For every 0 in the

column there can be an initiation. The function that can be


initiated depends on the row where
present in row i then an initiation
possible. This means

occurs. If a

is

for function i is

that the new initiation of function

i will not collide with any of the previous initiations.

Fig. 3.17

Initial cross collision matrix for division

Cross collision matrix


for division

Cross collision matrix


for multiplication

Cross collision matrix


for addition or subtraction

Fig. 3.18

Initial cross collision matrices for the three operations

145
However, this does not guarantee that it will not collide
with any other initiations that may be possible at the same
time. For each initiation there will be a new state. The new
state is determined by ORing the present collision matrix
with initial collision matrix corresponding to the function
i.
Step 3:

The compatible initiation set is determined as

follows: The compatible initiation set is basically the set


of functions that can be started at the same time without
any collisions. This is equivalent to placing the associated
reservation tables, one on top of the other and forming a
composite overlay and ensuring that there are no matches.
STEP 4:

For a single initiation, the generation of a new

state is explained as in step 2. If the column 0 contains


more than one 0, then multiple initiations of functions are
possible. When multiple initiations are required, the
functions are first checked to see whether they belong to
the class of compatible initiation set. If functions are
compatible , the new state is generated by ORing the present
state matrix with the combined collision matrix. The
combined collision matrix is derived by ORing all the
individual initial collision matrices representing the
functions that are to be currently initiated.
STEP 5:

If no initiation is possible in the present cycle,

the collision matrix is shifted one place to the left and


leading zeros are introduced from the right. The pipeline

146
remains in the present state. The column 1 now becomes the
column 0 and the step 1 to step 5 are again followed.
STEP 6:

All the new states from the present state matrix

have to be generated. considering the column 0, step 1 to


step 4 are carried out for all permissable initiations. If
no initiations are possible then step 5 is adopted. After
all the new states have been derived from column 0, the
state matrix is shifted one column to the right as in step
5. This process is carried on until all the columns in the

present state matrix have been processed. At the end, the


present state matrix will be a zero matrix.
Step 1 to step 6 are carried out for all the state
matrices that have been generated. This process is stopped,
when there is a state already existing, for every possible
initiations from other states. The state diagrams are linked
to each other by arcs. These arcs are labelled. The
labelling represents the function initiated and the latency.
All the states have to be derived so as to enable the system
to move from one state to another after each initiation.
The compatible initiation set is defined as a set of
functions that can be initiated as a single function or as
multiple functions, and cause no collisions between
themselves. The compatible initiation sets for the pipeline
under consideration are as follows:

I1 = { Addition ) .
I2 = { Subtraction ) .

I3 = { Multiplication ) .

I4 =

Division

I5 = { Multiplication, Addition )

I6

= {

~ultiplication,Subtraction

I7

= {

Division, Addition

I8 = { Division, Subtraction )

The functions are tagged with the following


number representations:
Addition and Subtraction => 1.
Multiplication =>
Division

2.

=> 3.

In the state matrices, the row 1 corresponds to the


division operation. The row 2 corresponds to multiplication.
The row 3 corresponds to addition or subtraction.

A 0 in row 1 implies that an initiation is possible for


the division operation. Similarly a 0 in row 2 implies that
an initiation of multiplication is possible. If there are
two 0s in a column in row 1 and row 3, then both division
and addition (or subtraction) are possible at the same time.
Assuming the initial collision matrix for the division
operation as the current state, the new states are derived
using the steps described above. This example will show how
the various new states are developed. The state matrix is
presented below in Fig. 3.19.:

Fig.

Initial state matrix which is the initial


cross collision matrix for division operation

3.1 9

The allowable latencies for each of the rows are listed


below in Fig. 3.20. T h e maximum t a b l e c o m p u t e t i m e is 8
clock cycles.

Fig. 3.20 The allowable latencies for the state matrix.

Looking at the columns, the compatible initiation set


for each column is shown in Fig. 3.21:

Column

Initiation set

= {

1 )

Column 1: Initiation set

= {

1 )

Column 2: Initiation set

= {

1 )

Column 3: Initiation set

= {

1,2,3,{1,2),{1,3))

0:

Column 4: Initiation set =

1,2,3,{1,2),{1,3))

Column 5: Initiation set =

2 )

Column 6: Initiation set

= {

2 )

Column 7: Initiation set

= {

,
,

Fig. 3.21.

2 )

The available initiation sets

For latency 0, the allowable initiation is addition


or subtraction only. For latency 3, all the compatible
operations are possible. In this example the initiation that
will be illustrated for this latency is

2 ) . Hence

this example will cover both the single initiation as well


as multiple initiations. Listed in Fig. 3.22 are the initial
collision matrices of addition and multiplication
respectively.
Latency

The state matrix is not

shifted to the left,

as the initiation is occurring at latency

0.

The new state

is derived by ORing the present collision matrix with the


initial collision matrix of addition. The resulting
collision matrix is the new state after the initiation of
addition. The operation and the result are listed in Fig.

Initial collision matrix


for multiplication
Fig. 3.22

Initial collision matrix


for addition and
subtraction

lnitial collision matrices for single initiation

Initial state matrix

Initial collision matrix


for addition and
subtraction

The new state matrix

Fig. 3.23

Generation of the new state matrix for single initiation

The shifted state matrix

Fig. 3.24

The new state matrix obtained by shifting left three times

Initial cross collision


matrix for multiplication

Fig.

3.25

3.26

Combined cross
collision matrix for
dual initiation

Process of deriving the combined initial cross collision matrix

The initial state


matrix

Fig.

Initial cross collision


matrix for addition

Combined cross
collision matrix

The new state matrix

The new state matrix for combined initiation of distinct functions

152
The process of generating the next state for the double
initiation is not

different from that of the single

initiation. The present state matrix is shifted to the left


by three columns for a latency of three. Leading zeros are
introduced at the right. The shifted initial state matrix
is shown below in Fig. 3.24. The two initial collision
matrices for addition and multiplication are ORed together
to generate the combined initial collision matrix for the
double initiation. The new state matrix is derived by oring
the current state matrix with the combined initial collision
matrix. The operation and results are shown in Fig. 3.25 and
Fig. 3.26.
The remaining new states are constructed in this
manner.

It should be noted that if more than one initiation

is possible in a column then it is necessary to derive new


states for each of them. All the possible states from the
initial state matrix are illustrated in Fig. 3 -27. The
transition between two states is marked by arrows. The
arrows are labelled on the top by the latency and at the
bottom by the initiation set.

1: Division operation
2: multiplication operation
3: addition or subtradion operation

Fig. 3.27

Plain number is the


latency
[ 1 is the cperation
initiated

The possible states from the initial collision matrix for division

CHAPTER FOUR

INSTRUCTION EXECUTION I N THE PIPELINE SYSTEM

The execution of instructions in the pipeline is


governed by the issue unit and the controllers in the
execution unit. The instructions are scheduled by resolving
the RAW, WAW and the operational hazard. The general
procedure for the execution of instructions in our proposed
system is recapped below:
An instruction is fetched during every pipeline cycle
from memory by the fetch unit. The fetch unit classifies the
instruction and generates an effective address ( E A ) if
needed. The EA is loaded in the appropriate counter. Two
instructions are fetched from memory when both the streams
are active. The fetch unit feeds two queues in the decode
unit. The individual streams are disabled once the
corresponding queues are filled.
The decode unit decodes the instruction and places the
information in the R field of the system status unit. The
decode unit is disabled if a jump instruction is awaiting
evaluation in the logic unit.
The function of the issue unit is to detect the hazards
and resolve them according to the algorithm developed in
C h a p t e r 2 . T h e i s s u e u n i t is d i s a b l e d i f a n y j u m p
instructions is being evaluated in the logic unit. The issue

unit assigns delay to the instruction and routes them to the


appropriate execution unit.
The instructions that are to be delayed are stored in
buffers provided at the input stages. The controller of the
execution unit is responsible for resolving the operational
hazards. The controller also checks for the available
latencies to accommodate instructions that are ready to be
executed, in the present pipeline cycle. The controller for
the arithmetic unit provides a feedback to the instruction
status unit. This feedback is in the form of updating the
counter fields, depending on the state of the system. The
controllers are also responsible for uploading the
destination register with the result of the instruction, as
soon a s they a r e out o f t h e pipe. A s a m p l e set of
instructions are listed below. The operation of the pipeline
is illustrated by executing the sample set. The process of
execution is demonstrated by displaying the flow of
instructions in the pipeline during each pipeline cycle
until all the instructions are executed.
following set of instructions:
load
load
load
add
store
add
mult
jnz

....

store

rl, 20;
r2, 30;
r3, 40;
r4, rl, r2;
k, r4;
r4, r2, r3;
r5, r2, r3;
r5, 60;

Consider the

156

The execution of each instruction is displayed in the


following figures and a brief explanation is provided for
each cycle. The issue unit schedules the instruction one
pipeline cycle later than the decode unit and hence a column
is provided in the instruction status unit which shows the
current instruction being issued. The fields are captioned
in the diagrams for easy identification.
Pipeline cycle # 0:

The instruction 'load

rl, 2 0

is fetched by the

fetch unit. The instruction is not a jump instruction and


hence the counters o f t h e counter set 2 are left
undisturbed. The current stream is still the PIC stream.
This is shown in Fig. 4.1.
Pipeline cycle # 1:

The second instruction is fetched from memory. The


first instruction is forwarded to the PIC stack and is
decoded by the decoder. The decoded information is recorded
in the instruction status unit. The second instruction
load r2, 30

is also not a jump instruction. The first load

instruction is initially loaded into the bottom of the PIC


queue. The bottom location is directly connected to the
decoder unit. As a result, the first instruction is decoded
and the decoded information is placed in the instruction
status unit. The Fig. 4.2 and 4.3 illustrate the presence
of the instruction in stages 1 and 2.

Counter 1

load R1, 20;


Fetch unit

Counter 2

EAC
Queue

PIC
Queue

der unit 1

Decoder unit 2

issue
unit 2

lssue
unit 1

Logic
unit
Fixed
point
unit

Floating
point
unit

Register
set
Fig. 4.1

The state of the system at pipeline cycle # 0

Counter 1

load R2, 30;

Fetch unit

Counter 2

EAC
Queue

PIC
Queue

Decoder unit 1

Decoder unit 2

lssue
unit 2

lssue
unit 1

Logic
unit
Fixed
point
unit

Floating
point
unit

Register
set
Fig. 4.2

The state of the system at pipeline cycle # 1

Op-code is the opcode of the Instruction.


Time: Time is the time required to execute the instruction.
R Field : R field is the field of all registers in the system..
C Field : C field is the field of the counters that keep track of the registers.

Fig. 4.3

State of the instruction status unit during pipeline cycle # 1

160

Pipeline cycle # 2:
The first instruction is in the issue unit. The
contents of the counter cl which represents the register rl,
is zero. This implies that there is no RAW or WAW hazard.
The Tinst-delay is calculated as follows:
Csink(old)
'test

= O

From equation (2.14) Tinse.delay


= 0.
~ccordingto equation (2.17)

The issue unit issues the instruction 1 to the logic unit


without assigning any delay. The instruction 2 is in the
decode unit and instruction 3 is in the fetch unit. The
instruction 1 would load the register rl with the new value
after it has been executed by the logic unit. The total time
that the load operation would need to execute is 6 pipeline
cycles and hence the counter cl is set to 6. The current
value of cl denotes the number of pipeline cycles needed
(with respect to the present pipeline cycle) for rl to be
loaded with 20. Fig. 4.4, and 4.5 illustrate the state of
the system.
Pipeline cycle # 3:
The instruction 1 is in the logic unit. Instruction 2
is in the issue unit. Instruction 3 is in the decode unit
and instruction 4 is in the fetch unit. Instruction 4 is an
arithmetic instruction. The counters of the EAC stream are

Counter 1

load R3, 40;


Fetch unit

Counter 2

EAC
Queue

Decoder unit 2

Decoder unit 1

NO RAW hazard for lnstr # 1

II
m

Issue
unit 2

No WAW hazard for lnstr # 1


I

Issue instr # 1 for execution


--

lssue
unit 1

Route instruction to logic unit


i

Fixed
point
unit

Floating
point
unit

Register
set
Fig. 4.4

The state of the system at pipeline cycle # 2

II

Op-code is the opcode of the Instruction.


Time: Time is the time required to execute the instruction.
R - Field : R field is the field of all registers in the system..
C - Field : C - field is the field of the counters that keep track of the registers.

Fig. 4.5

State of the instruction status unit during pipeline cycle # 2

163

still not operational. The instruction 2 is checked for the


hazards and is issued to the logic unit. c, is decremented
as it represents one less pipeline cycle for rl to be loaded
with the new value. The instruction 2 has no hazards and is
issued to the logic unit by initializing the counter c2, to
a value equal to 6. The delay is calculated as in pipeline
cycle # 2. These are illustrated in Fig. 4.6 and 4.7. The
counter values in the instruction status unit are listed
below:
c1=5,c2=6
Pipeline cycle #

4:

This is similar to pipeline cycles 2 and 3. The


instruction number 3 is in the issue unit. The initial value
of the counter c3 is 0. NO delay is needed. The instruction
is issued to the logic unit. Fig. 4.8 and 4.9 illustrate the
data flow for this cycle. The instruction in the fetch unit
is not a branch instruction.
Pipeline cycle # 5:
The fifth instruction is in the issue unit. RAW hazard
is detected, as the previous instructions ( #1 and #2) have
not yet initialized the registers with the new value. From
the system status unit cl = 3 and c2 = 4. There is no WAW
hazard as c4 = 0. The delays are calculated as follows:
csource- reg1

Csource- reg2

- 4

Csink(old)

= 0

Counter 1

add R4, R1, R2;

Fetch unit

Counter 2

EAC
Queue

PIC
Queue

Decoder unit 2

Decoder unit 1

No RAW hazard for lnstr # 2

No WAW hazard for lnstr # 2


m

'

lssue instr # 2 for execution


lssue
unit 2

lssue
unit 1

Route instruction to logic unit

Logic
unit
Fixed
point
unit

Floating
point
unit

Register
set
Fig. 4.6

The state of the system at pipeline cycle # 3

Op-code is the opcode of the Instruction.


Time: Time is the time required to execute the instruction.
R Field : R - field is the field of all registers in the system..
C Field : C - field is the field of the counters that keep track of the registers.

Fig. 4.7

State of the instruction status unit during pipeline cycle # 3

Counter 1

store (K), R4;

Fetch unit

Counter 2

EAC
Queue

Decoder unit 2

Decoder unit 1

No RAW hazard for lnstr # 3


Y

'

No WAW hazard for lnstr # 3


I I H I I - - - - - - - -

lssue instr # 3 for execution


lssue
unit 2

lssue
unit 1

Route instruction to logic unit

Logic
unit
Fixed
point
unit

Floating
point
unit

Register
set
Fig. 4.8

The state of the system at pipeline cycle # 4

Op-code is the opcode of the Instruction.


Time: Time is the time required to execute the instruction.
R Field : R field is the field of all registers in the system..
C Field : C field is the field of the counters that keep track of the registers.

Fig. 4.9

State of the instruction status unit during pipeline cycle # 4

Tsrc-delay
= 5,
'test

from equation (2.20)

from equation (2.18)

= 5 + 3 - 1 = 7 ,

From equation (2.29),


*test

'

Csink(old)

Tinst-delay

and

Ttest

> 2.

Csink(old)

Tsrc-delay = 5'

from equation (2.18)


Csink(new)

- T

+ (Tsr,-d,,a,
- 1)

= 3

4 = 7,

The instruction is issued to the DS in the stage 6 of the


arithmetic unit. The counter c4 is initialized to 7. The
register r, will be initialized

to the new value after a

period of 7 pipeline cycles. The instruction in the fetch


unit is not a branch instruction. The process is illustrated
in Fig. 4.10 and 4.11.
Pipeline cycle # 6:
The present instruction in the issue unit is the first
store instruction. The execution of this instruction has to
be delayed. The execution is delayed by 7 pipeline cycles.
The delay is computed as shown in the previous cycle. This
instruction will be held in DS at the UJ unit until the time
the instruction delay counter counts down to zero. The state
of the pipeline is shown in Figs. 4.12, 4.13, 4.14 and 4.15.
Pipeline cycle # 7:
The second add instruction is in the issue unit. The
sink register is r,. The previous write process for r4 has
not been completed. The counter c4 from the instruction
status unit is equal to 5 pipeline cycles. This denotes that

Counter 1

add R4, R2, R3;

Fetch unit

Counter 2

EAC
Queue

Decoder unit 1

Decoder unit 2

t
I

lssue
unit 2

RAW hazard for lnstr # 4


u

No WAW hazard for lnstr # 4


I

Instruction delay set equal to 5

Route instruction to fixed point unit

Issue
unit 1

Logic
unit
Fixed
point
unit

Floating
point
unit

Register
set
Fig. 4.10

The state of the system at pipeline cycle # 5

Op-code is the opcode of the Instruction.


Time: Time is the time required to execute the instruction.
R Field : R - field is the field of all registers in the system..
C - Field : C field is the field of the counters that keep track of the registers.

Fig. 4.11

State of the instruction status unit during pipeline cycle # 5

Counter 1
Fetch unit
Counter 2

EAC
Queue

HEl

PIC
Queue

add R5 R2 R3.

Decoder unit 2

Decoder unit 1

RAW hazard for lnstr # 5

I - _ - - - - - -

No WAW hazard for lnstr # 5

Instruction delay set equal to 5


Issue
unit 2

lssue
unit 1

Route instruction to logic unit


i

Logic
unit
Fixed
point
unit

Floating
point
unit

Register
set
Fig. 4.12

The state of the system at pipeline cycle # 6

Op-code is the opcode of the Instruction.


Time: Time is the time required to execute the instruction.
R - Field : R field is the field of all registers in the system..
C - Field : C field is the field of the counters that keep track of the registers.

Fig. 4.13

State of the instruction status unit during pipeline cycle # 6

Each Unit is a Delay Buffer.

Unit 4
Unit 5
Unit 6

Unit 7

Pr # : Priority number attached to each unit


ASRl : Address of source register 1 ASR2 :Address of source register 2
DSR2 : Delay of source register 2.
DSR1 : Delay of source register 1.
SD1 : Source Data 1.
SD2 : Source Data 2.
ID : Instruction Delay.
DR : Destination Resource
Fig. 4.14

State of the delay station in stage 1 of the AU during pipeline cycle # 6

Each Unit is a Delay Buffer.

Unit 4
Unit 5
Unit 6
, Unit 7

Pr # : Priority number attached to each unit


ASR1 : Address of source register 1 ASR2 :Address of source register 2
DSR1 : Delay of source register 1.
DSR2 : Delay of source register 2.
SD1 : Source Data 1.
SD2 : Source Data 2.
ID : Instruction Delay.
DR : Destination Resource
Fig. 4.15

State of the delay station in stage 6 of the AU during pipeline cycle # 6

174

the previous instruction initializing r4 will not complete


execution for another 5 cycles. The present instruction need
not be delayed for 5 cycles. The instruction delay is
computed as shown below.
- 2

Csource- reg1

Csource- reg2 = 3

Csink(old)

from equation (2.20)

Tsrc-delay =
'test

'test

' Csink(old)

(Tsrc-delay

and

1)

(Ttest

3+4-1

Csink(old)

= 6,

from equation (2.18)

) = I

using equation (2.27)


Tinst-delay

(Tsrc-delay

+1)=4+1=5.

using equation (2.30)


Csink(new)

Tsrc-delay =

7'

The new value would be loaded into the register after 7


cycles. T h e counter c4 is re-initialized t o 7. The
instructions that are using the previous value of r4 as a
source operand are all referenced to time when the old value
(result of the previous add instruction) would be loaded
into r4. Thus as soon as the old value is loaded into r4, the
buffers that need it only will capture it.

Once the data

is captured, the buffer would not reload until it is reset.


The instruction in the fetch unit is a jump instruction
based on the result of r5. If the branch is positive, the
destination address is the instruction with the label # 60.
The destination address is loaded into the counter 0 of the

Counter 1

jnz R5, 60;

Counter 2

Fetch unit

Branch

EAC
Queue

Decoder unit 2

Decoder unit 1

RAW hazard for lnstr # 6


L

-I------

WAW hazard for lnstr # 6


m

Instruction delay set equal to 5


Issue
unit 2

Route instruction to fixed point unit

Issue
unit 1

Logic
unit
Fixed
point
unit

Floating
point
unit

Register
set
Fig. 4.16

The state of the system at pipeline cycle # 7

mult

Op-code is the opcode of the Instruction.


Time: Time is the time required to execute the instruction.'
R Field : R - field is the field of all registers in the system..
C Field : C field is the field of the counters that keep track of the registers.

Fig. 4.17

State of the instruction status unit during pipeline cycle # 7

Each Unit is a Delay Buffer.

Unit 4
Unit 5

.Unit

, Unit 7
Pr # : Priority number attached to each unit
ASRl : Address of source register 1 ASR2 :Address of source register 2
DSR1 : Delay of source register 1.
DSR2 : Delay of source register 2.
SD1 : Source Data 1.
SD2 :Source Data 2.
ID : Instruction Delay.
DR : Destination Resource
Fig. 4.18

State of the delay station in stage 1 of the AU during pipeline cycle # 7

Each Unit is a Delay Buffer.

Unit 4
Unit 5
Unit 6

, Unit

Pr # : Priority number attached to each unit


ASRl : Address of source register 1 ASR2 :Address of source register 2
DSR1 : Delay of source register 1.
DSR2 : Delay of source register 2.
SD1 : Source Data 1.
SD2 : Source Data 2.
I D : Instruction Delay.
DR : Destination Resource
Fig. 4.19

State of the delay station in stage 6 of the AU during pipeline cycle # 7

Each Unit is a Delay Buffer.

Unit 4
Unit 5
Unit 6
Unit 7
2

Pr # : Priority number attached to each unit


ASR1 : Address of source register 1
DSRl : Delay of source register 1.
SD1 : Source Data 1.

ID : Instruction Delay.

DR : Destination Resource

Fig. 4.20 State of the delay station in the LU during pipeline cycle # 7

179

EAC stream. The EAC stream will become operational in the

following pipeline cycle. This is illustrated in Figs.4.16,


4.17, 4.18, 4.19 and 4.20.
Pipeline cycles # 8:

The multiplication instruction is issued to the DS of


stage 1 in the AU unit. The delay of 3 cycles is necessary
to resolve the RAW hazard. The delay is calculated as shown
in pipeline cycle #5. The counter c5 is initialized with the
value of 10. The EAC stream fetches the instruction starting
from the label 60 along with the P I C stream. Both the
instructions are assumed to be non branch instructions. The
jump instruction is in the decode stage. The register R1 is
loaded with the value of 20. The state of the pipeline is
illustrated in Figs 4.21 to 4.25.
Pipeline cycle # 9:

The jump instruction is in the issue unit. From the


instruction status unit, the value of r5 will be available
only after 11 cycles. Hence the jump instruction can only
be evaluated after 11 cycles. The instruction is issued to
the DS of the LU unit with a delay of 12 cycles. The issue
unit along with the decode unit will be disabled for the
next 11 cycles from the next cycle. The instructions present
at the bottom of both the stacks are decoded and the
information is placed in the instruction status unit. The
register R2 is updated with the value of 30. This is
illustrated in Figs. 4.26 to 4.30.

Counter 1
Fetch unit
Counter 2

EAC
Queue

Decoder unit 2

Decoder unit 1

C
I

lssue
unit 2

RAW hazard for lnstr # 7


u

No WAW hazard for lnstr # 7


U

Instruction delay set equal to 3


Issue
unit 1

Route instruction to fixed point unit

Logic
unit
Fixed
point
unit

Floating
point
unit

Register
set

Fig. 4.21

The state of the system at pipeline cycle # 8

Op-code is the opcode of the Instruction.


Time: Time is the time required to execute the instruction.
R Field : R field is the field of all registers in the system..
C Field : C field is the field of the counters that keep track of the registers.

Fig. 4.22

State of the instruction status unit during pipeline cycle # 8

Each Unit is a Delay Buffer.

Unit 4
Unit 5
Unit 6
Unit 7
Pr # : Priority number attached to each unit
ASRI : Address of source register 1 ASR2 :Address of source register 2
DSR1 : Delay of source register 1.
DSR2 : Delay of source register 2.
SD1 :Source Data 1.
SD2 : Source Data 2.
ID : Instruction Delay.
DR : Destination Resource
Fig. 4.23

State of the delay station in stage 1 of the AU during pipeline cycle # 8

Each Unit is a Delay Buffer.

Pr # : Priority number attached to each unit


ASR1 : Address of source register 1 ASR2 :Address of source register 2
DSR1 : Delay of source register 1.
DSR2 : Delay of source register 2.
SD1 : Source Data 1.
SD2 : Source Data 2.
ID : Instruction Delay.
OR : Destination Resource
Fig. 4.24

State of the delay station in stage 6 of the AU during pipeline cycle # 8

Each Unit is a Delay Buffer.

Unit 4
Unit 5
Unit 6
I

Unit 7
Pr # : Priority number attached to each unit
ASR1 : Address of source register 1
DSRl :Delay of source register 1.
SD1 : Source Data 1.

ID : Instruction Delay.

Fig. 4.25

DR : Destination Resource

State of the delay station in the LU during pipeline cyde # 8

PIC
Queue

EAC
Queue

Decoder unit 1

Decoder unit 2

------

RAW hazard for lnstr # 8

Hold instruction in issue unit


.

.
Issue
unit 2

Fixed
point
unit

~
-

Instruction delay set equal to 9


i

Issue
unit 1

Route instruction to logic unit

Floating
point
unit

Register
set
Fig. 4.26 The state of the system at pipeline cycle # 9

mult

Op-code is the opcode of the Instruction.


Time: Time is the time required to execute the instruction.
R Field : R field is the field of all registers in the system..
C Field : C field is the field of the counters that keep track of the registers.

Fig. 4.27

State of the instruction status unit during pipeline cycle # 9

Each Unit is a Delay Buffer.

Unit 4
Unit 5
,

Unit 6

Unit 7

Pr # : Priority number attached to each unit


ASR1 :Address of source register 1 ASR2 :Address of source register 2
DSR2 : Delay of source register 2.
DSR1 : Delay of source register 1.
SD2 : Source Data 2.
S D I : Source Data 1.
ID : Instruction Delay.
DR : Destination Resource
Fig. 4.28

State of the delay station in stage 1 of the AU during pipeline cycle # 9

---- - -

- - -- --

Each Unit is a Delay Buffer.

Unit 4
Unit 5
Unit 6
Unit 7
Pr # : Priority number attached to each unit
ASR1 : Address of source register 1 ASR2 :Address of source register 2
DSR2 : Delay of source register 2.
DSR1 : Delay of source register 1.
SD1 : Source Data 1.
SD2 : Source Data 2.
ID : Instruction Delay.
DR : Destination Resource
Fig. 4.29

State of the delay station in stage 6 of the AU during pipeline cycle # 9

Each Unit is a Delay Buffer.

Unit 4
Unit 5
Unit 6
Unit 7
Pr # : Priority number attached to each unit
ASR1 : Address of source register 1
DSR1 :Delay of source register 1.
SD1 : Source Data 1.

ID : Instruction Delay.

Fig. 4.30

DR : Destination Resource

State of the delay station in the LU during pipeline cyde # 9

188

Pipeline cycle # 10:


The first add instruction is due for execution. The
initial cross collision matrix was a null matrix until this
cycle. The first instruction to be initiated is the add
instruction. The initial cross collision matrix for addition
will now become the initial state matrix of the pipeline
system. The initial state matrix is shown in Fig. 4.31. The
add instruction is initiated as the latency is available.
The register R3 is updated with a value of 40. The state of
the system is illustrated in Figs. 4.32 to 4.35.
Pipeline cycle # 11:
The multiplication instruction is due for execution.
The state matrix of the previous cycle is shifted one row
to the left. The latency is checked for multiplication by
examining the row 1 and column 0. The latency is available.
The instruction is initiated and the new state matrix is
s h o w n in Fig. 4 . 3 6 . T h e data f l o w in t h e system is
illustrated in Figs. 4.37 to 4.40
Pipeline cycle # 12:
The second add instruction is due for execution. The
state matrix of the previous cycle is shifted one column to
the left and is illustrated in Fig. 4.41. The latency for
add is available as the element of row 3 and column 0
contains 0. The instruction is initiated and the new state
matrix is obtained as shown in Fig. 4.42. The data flow is
illustrated in Figs. 4.43 to 4.46.

Fig. 4.31

Initial state matrix in cycle # 10

Counter 1
Fetch unit
Counter 2

EAC
Queue

Decoder unit 2

Decoder unit 1

Hold instruction in issue unit


Disabled

Issue
unit 2

Disabled

lssue
unit 1

Logic
unit
Fixed
point
unit

Floating
point
unit

Register
set

Fig. 4.32 The state of the system at pipeline cycle # 10

Each Unit is a Delay Buffer

Unit 4
Unit 5
Unit 6
Unit 7
Pr # : Priority number attached to each unit
ASRl : Address of source register 1 ASR2 :Address of source register 2
DSR2 : Delay of source register 2.
DSR1 : Delay of source register 1.
SD2 : Source Data 2.
SD1 : Source Data 1.
I D : Instruction Delay.
DR : Destination Resource
Fig. 4.33

State of the delay station in stage 1 of the AU during pipeline cycle # 10

Each Unit is a Delay Buffer.

Unit 4
Unit 5
Unit 6
Unit 7
Pr # : Priority number attached to each unit
ASR1 : Address of source register 1 ASR2 :Address of source register 2
DSR1 : Delay of source register 1.
DSR2 : Delay of source register 2.
SD1 : Source Data 1.
SD2 : Source Data 2.
I D : Instruction Delay.
DR : Destination Resource
Fig. 4.34

State of the delay station in stage 6 of the AU during pipeline cycle # 10

Each Unit is a Delay Buffer.

Unit 4
Unit 5
Unit 6
Unit 7
i

Pr # : Priority number attached to each unit


ASRl :Address of source register 1
DSR1 : Delay of source register 1.
SD1 : Source Data 1.

ID : Instruction Delay.

DR : Destination Resource

Fig. 4.35 State of the delay station in the LU during pipeline cyde # 10

Fig. 4.36

The new state matrix in cycle # 11

Counter 1
Fetch unit
Counter 2

EAC
Queue

Decoder unit 1

Decoder unit 2

.
Hold instruction in issue unit
Disabled
I

lssue
unit 2

Disabled
---

- -

lssue
unit 1

Logic
unit
Fixed
point
unit

Floating
point
unit

Register
set
Fig. 4.37 The state of the system at pipeline cycle # 11

Each Unit is a Delay Buffer.

Unit 4
Unit 5
Unit 6

Unit 7

Pr # : Priority number attached to each unit


ASRl : Address of source register 1 ASR2 :Address of source register 2
DSR1 : Delay of source register 1.
DSR2 : Delay of source register 2.
SD1 : Source Data 1.
SD2 : Source Data 2.
I D : Instruction Delay.
DR : Destination Resource
Fig. 4.38

State of the delay station in stage 1 of the AU during pipeline cycle # 11

Each Unit is a Delay Buffer.

Unit 4
Unit 5
Unit 6
Unit 7
Pr # : Priority number attached to each unit
ASR1 : Address of source register 1 ASR2 :Address of source register 2
DSR1 : Delay of source register 1.
DSR2 : Delay of source register 2.
S D l : Source Data 1.
SD2 : Source Data 2.
I D : Instruction Delay.
DR : Destination Resource
Fig. 4.39

State of the delay station in stage 6 of the AU during pipeline cycle # 11

Each Unit is a Delay Buffer.

Unit 4
Unit 5
Unit 6
Unit 7
Pr # : Priority number attached to each unit
ASR1 : Address of source register 1
DSRl : Delay of source register 1.
SD1 : Source Data 1.

ID : Instruction Delay.

DR : Destination Resource

Fig. 4.40 State of the delay station in the LU during pipeline cyde # 11

Fig. 4.41

The shifted state matrix of cycle # 11

Fig. 4.42

The new state matrix of cycle # 12

Counter 1
Fetch unit
Counter 2

EAC
Queue

Decoder unit 1

Decoder unit 2

Hold instruction in issue unit


Disabled
Disabled

lssue
unit 2

rn

lssue
unit 1

Disabled

Logic
unit
Fixed
point
unit

Floating
point
unit

Register
set
Fig. 4.43 The state of the system at pipeline cycle # 12

Each Unit is a Delay Buffer.

Unit 4
Unit 5
, Unit

Unit 7
Pr # : Priority number attached to each unit
ASR1 : Address of source register 1 ASR2 :Address of source register 2
DSR2 : Delay of source register 2.
DSR1 : Delay of source register 1.
SD1 : Source Data 1.
SD2 : Source Data 2.
ID : Instruction Delay.
DR : Destination Resource
Fig. 4.44 State of the delay station in stage 1 of the AU during pipeline cycle # 12
Each Unit is a Delay Buffer.

Unit 4
Unit 5
Unit 6
Unit 7
Pr # : Priority number attached to each unit
ASR1 : Address of source register 1 ASR2 :Address of source register 2
DSR1 : Delay of source register 1.
DSR2 : Delay of source register 2.
SD1 : Source Data 1.
SD2 : Source Data 2.
ID : Instruction Delay.
DR : Destination Resource
Fig. 4.45 State of the delay station in stage 6 of the AU during pipeline cycle # 12

Each Unit is a Delay Buffer.

Unit 4
Unit 5
Unit 6
Unit 7
Pr # : Priority number attached to each unit
ASR1 : Address of source register 1
DSR1 : Delay of source register 1.
SD1 : Source Data 1.

ID : Instruction Delay.

Fig. 4.46

DR : Destination Resource

State of the delay station in the LU during pipeline cyde # 12

P i p e l i n e Cycles # 13

# 20:

The results are updated as they become available and


the branch instruction is evaluated during the pipeline
cycle # 19. These are illustrated in Figs. 4.47 to 4.54.

Counter 1
Fetch unit
Counter 2

EAC
Queue

ww

PIC
Queue

Decoder unit 1

Decoder unit 2

Hold instruction in issue unit


Disabled
Disabled

lssue
unit 2

lssue
unit 1

Disabled

Logic
unit
Fixed
point
unit

Floating
point
unit

Register
set
Fig. 4.47 The state of the system at pipeline cycle # 13

Each Unit is a Delay Buffer.

Unit 4
Unit 5
Unit 6
I

Unit 7

Pr # : Priority number attached to each unit


ASR1 : Address of source register 1
DSRl : Delay of source register 1.
SD1 : Source Data 1.

ID : Instruction Delay.

DR : Destination Resource

Fig. 4.48 State of the delay station in the LU during pipeline cycle # 13

Counter 1
Fetch unit
Counter 2

EAC
Queue

PIC
Queue

Decoder unit 2

Decoder unit 1

Hold instruction in issue unit


Disabled
Disabled

Issue
unit 2

+-

Issue
unit 1

Disabled
m-

Logic
unit

Fixed
point
unit

Floating
point
unit

Register
set
Fig. 4.49 The state of the system at pipeline cycle # 14

Each Unit is a Delay Buffer.

Unit 4
Unit 5
Unit 6
Unit 7
Pr # : Priority number attached to each unit
ASRl : Address of source register 1
DSR1 : Delay of source register 1.
SD1 : Source Data 1.

ID : Instruction Delay.

DR : Destination Resource

Fig. 4.50 State of the delay station in the LU during pipeline cyde # 14

Counter 1

etch

unit

Counter 2

EAC
Queue

Decoder unit 2

Decoder unit 1

Hold instruction in issue unit


Disabled

lssue
unit 2

Fixed
point
unit

lssue
unit 1

Disabled

Floating
point
unit

Register
set
Fig. 4.51 The state of the system at pipeline cycle # 19

Each Unit is a Delay Buffer.


Pr #

ASR1

R5

Unit 1

DSR1
0

SD1
1200

ID

DR

Unit 2
r

Unit 3
Unit 4
P

Unit 5
I

Unit 6
r

Unit 7

Pr # : Priority number attached to each unit


ASRl :Address of source register 1
DSR1 : Delay of source register 1.
SD1 : Source Data 1.

ID : Instruction Delay.

Fig. 4.52

DR : Destination Resource

State of the delay station in the LU during pipeline cyde # 19

Counter 1
Fetch unit
Counter 2

EAC
Queue

PIC
Queue

Decoder unit 2

Decoder unit 1

No RAW hazard for instr# 60


n

No WAW hazard for instr# 60


-

No delay is assigned
Issue
unit 2

Route the instruction to logic unit

Issue
unit 1

Logic
unit
Fixed
point
unit

Fig. 4.53

Floating
point
unit

Register
set
The state of the system at pipeline cycle, # 20

Op-code is the opcode of the Instruction.


Time: Time is the time required to execute the instruction.
R Field : R field is the field of all registers in the system..
C Field : C field is the field of the counters that keep track of the registers.

Fig. 4.54

State of the Instruction status unit during pipeline cycle # 20

CHAPTER FIVE
COMPUTER SIMULATION AND EXPERIMXNTAL RESULTS

The operation of the system is simulated on the DEC VAX


11/750 mini-computer. The simulation program is implemented

in two sections. The first section simulates the PIU and the
second simulates the PEU. In real time operation, the
various units are synchronised. The units are termed as
stages. The total number of stages in the system are ten.
The first three stages are the fetch unit, decode unit and
issue unit respectively. The remaining seven stages
constitute the stages of the pipelined arithmetic unit. The
program is written in C language. Each stage is simulated
by a single function. This is illustrated in Fig. 5.1.
In actual operation, the stages operate concurrently.
For example, let us assume that decode unit will receive an
instruction I from the fetch unit at the begining of the
cycle J. It processes the instruction and forwards it to the
issue u n i t a t t h e end of t h e cycle. T h e issue unit
meanwhile, receives the instruction 1-1 at the begining of
the cycle J. The instruction I will be received by the issue
unit only at the begining of the cycle J+1. The Fig. 5.2
illustrates the data flow. The stages begin processing at
the begining of each cycle and complete processing at the
end of each cycle. This implies that the simulation program
must begin the execution of functions at the same time. The

Function Fetch-unit

Function Decode-unit
t

Function Issue-unit
*

Function Stage-one

Function Stage-two

Function Stage-three

Function Stage-four
>

.
Function Stage-five
i

Function Stage-six

Function Stage-seven

Fig. 5.1

Structure of the simulation program

Begining of pipeline cycle # I

Begining of pipeline cycle # I

Fig 5.2

Actual data flow in real time for a pipeline system

213

executions must also end at the same time. This is not


possible in a serial machine. The program is executed in
iterations where each iteration represents one pipeline
cycle. In each iteration, the functions emulating the stages
are executed serially in their physical order. Furthermore,
the input for each function is provided by the preceeding
function. The parallelism of executing the functions will
be lost, if the output of any function is fed directly to
the input of the next function. This will lead to the
pipeline being a large sequential system. The concurrency
is introduced into the serial program by bifurcating the
processing and the data transfer. Each function will read
the input from a buffer called the input buffer. The result
of the function is loaded into an output buffer. The next
function will read the data from the input buffer of the
function but not from the output buffer of the previous
function. For example, the function emulating the decode
unit will read the input from the input buffer specified to
this unit and processes the instruction. The processed
instruction is stored in an output buffer designated to the
decode unit. The next function emulating the issue unit will
read the instruction from its input buffer but not from the
output buffer of the decode unit. This provides isolation
of data between two adjacent functions. The data transfer
is carried out before the begining of the next iteration.
This is shown in Fig. 5.3. The whole program is executed

The data flow in iteration I between functions during operationg mode

lnput buffer Output buffer

lnput buffer Output buffer

(result of operation)

(result of operation)

Start .
b
End
Start ,
-b
End
excuting
excuting
excuting
excuting
Function decode-unit Function decode-unit Function issue-unit Function issue-unit

Program Flow
The data transfer operation during data transfer mode during iteration I:

lnput buffer Output buffer

lnput buffer Output buffer

Only data transfer taking place. No function is executed


Fig. 5.3

Emulating the concurrency operation in a serial program.

effectively in two modes: the operating mode and the data


transfer mode. Thus parallelism is obtained by serially
executing the functions and controlling the data transfer.
The program is subdivided into two groups of functions.
The first group is called the Tongroup. The Tongroup
represents the serial execution of the various emulating
functions. The second group is termed as the TOf,group. Toff
group represents the data transfer functions. The various
functions are described in the following sections:
5.1

FUNCTIONS EMULATING THE STAGES OF THE PIU:

The PIU is emulated by three functions: 1) fetch-unit,


2) decode-unit, and 3) issue-unit. The memory, the input and

the output buffers for each function are held in structures.


A structure is a collection of one or more variables,

possibly of different types, grouped together under a single


name for convenient handling. The structures of the buffers
and the memory are of the same type and is shown below:
struct input inst
{

int
int
int
int
int

opcode field;
source-operandl;
source-operand2;
dest operand;
valid;

The various functions are described below.


Function "fetch-unit" :
This function emulates the operation of the fetch unit.
The function is provided with two sets of counters. These

216

counters are used to keep track of the branch instructions.


The counters are stored as structures and are individually
indexed from 0 to 9. The counter indexed as 0 is the program
counter. The instructions are classified by this function.
If a branch instruction is encountered, the effective
address held in the destination operand is loaded into the
appropriate counter. The instruction is classified using the
opcode. The function returns with the processed instruction
loaded into the output buffer. The function is provided with
two ouput buffers, one for each stream.
Function "decode-unit" :

This function emulates the decode unit. It consists of


a two FIFO stacks, which are represented as structures. The
data is read from the input buffer and is loaded into a FIFO
queue. The function contains two queues and two decoding
routines. The decode unit has a queue and a decoder routine
for each stream.

Pointers are associated with each queue.

The pointer top-stack indicates the next free location. The


pointer bottom-stack points to the next instruction to be
processed. When the queue is empty, the top-stack is equal
to the bottom-stack. The instruction travels through the
queue before it is decoded. The decoded instruction is
stored in the output buffer. The information that is made
available in the decoding process is stored in a structure
which represents the instruction status unit. This
information is utilized by the issue unit to schedule the

execution of the instruction.


Function

"

issue-unit ":

The function issue-unit emulates the function of the


issue-unit. The functions receives the instruction from the
input buffer and detects the hazard. The hazards are
detected by looking at the counter values which are assigned
t o each register. These values are available in the
structure that represents the instruction status unit. The
hazards are resolved and the instruction is issued to the
execution unit. The hazards are resolved by the equations
that have been derived in Chapter 2. The function returns
with the issued instruction placed in the output buffer.
5.2 FUNCTIONS EMULATING THE STAGES OF THE PEU:

The execution unit is the pipelined artihmetic unit.


The logic unit and the floating point unit are treated as
black boxes. The functions emulating the stages of the
execution unit are also named after their respective stage.
For example, the function emulating stage one of the PEU is
called as function stage-one.
Function

stage-one ":

The function stage-one emulates the operation of a


shifted multiplicand generator. This function generates the
initial partial products. The partial products are summed
up to derive the final product. The shifted multiplicand
generator is an landt gate array. The vectors that are
generated follow the following equations:

a3*bj a4*bj a5*bj a6*bj a7*bj


W I = a O *b. Ja *b.~ a2*bj
J

( j = 0)

= a * b - al*b. a2*b. a,*bj a4*bj a,*bj a6*bj a7*bj

( j = 1)

W3 = a O *b. Ja *b.~ a2*bj


a3*bj a4*bj a5*bj a6*bj a7*bj
J

( j = 2)

W4 = ao*b. al*b. a2*bj a3*bj a,*bj a5*bj a6*bj a7*bj

( j = 3)

W5 = a O *b. Ja *b.~ a2*bj


a3*bj a4*bj a5*bj a,*bj a7*bj
J

( j = 4)

w6

a O *b.J a l *b.J a2*b.


a3*b.J a4*bj a5*bj a6*bj a7*bj
J

( j = 5)

W7 = a O *b. Ja *b.~ a2*b.


a3*bj a,*bj a,*bj a6*bj a7*bj
J J

( j = 6)

w8

( j = 7)

w2

a O *b.J a l *b.J a 2 * b
a3*bj a4*bj a5*bj a6*bj a7*bj
J

The e l e m e n t s a i a n d b j b e l o n g t o t h e i n p u t
b i n a r y v e c t o r s A and B.
Functions

"

stage-two

to

stage-five

n:

The s t a g e s two t o f i v e c o n s i s t o f t h e C S A e l e m e n t s .
These f u n c t i o n s s i m u l a t e t h e o p e r a t i o n of t h e CSA elements.
The c a r r y s a v e a d d e r is r e p r e s e n t e d by u s i n g t h e e q u a t i o n s
o f t h e sum a n d t h e c a r r y v e c t o r s . T h e s e f u n c t i o n s r e t u r n
w i t h t h e r e s u l t i n t h e o u t p u t b u f f e r s which a r e r e p r e s e n t e d
a s structures.
Function

"

stage-six ":

The c a r r y look ahead adder i s a l s o reproduced w i t h t h e


a i d of s t r u c t u r e s and f i e l d s . The p a r t i a l sum v e c t o r and t h e
p a r t i a l c a r r y v e c t o r from t h e f u n c t i o n s t a g e-f i v e , i s t h e
i n p u t t o t h e p r e s e n t f u n c t i o n . This f u n c t i o n f i r s t g e n e r a t e s
t h e c a r r y e l e m e n t s u s i n g t h e i n p u t s , and t h e n t h e a d d i t i o n
t a k e s p l a c e i n t h e f u l l adder using t h e s e c a r r y elements.
The f u n c t i o n can a l s o r e c e i v e t h e i n p u t s e x t e r n a l l y . These
i n p u t s a r e n o t r e l a t e d t o t h e o u t p u t of s t a g e f i v e . The add

219

instruction is introduced to the adder through the above


inputs.
The functions stage-one to stage-six are combined into
a single function called 99pipelineN.
5.3

CONTROL OF THE PIPELINE:

The pipeline activity has to be controlled to avoid the


collisions of the data that is fed into the pipeline. The
control of the pipeline can be broadly classified into the
following control functions namely: 1) load-pipeline, 2)
output-check, 3) set-logg, and 4) shift-track. The function
load - pipeline mainly deals with the initiation of
instructions to the pipeline. The function output-check
loads the results of instructions to their destination. The
function shift- trac is used to monitor the flow of
instructions in the arithmetic unit. The function set-logg
monitors the activity of the function shift-trac.
The following structures are used by the functions to
control the operation of the pipeline: 1) struct reg-stages,
2 ) struct iter-storage, 3 ) struct add-trac, 4 ) struct

mult-trac, 5) struct div-trac, 6) struct logg-sheet, 7)


struct multipurpose-registers.
The names of these structures state their respective
operations. Structure reg-stages are the output buffers. The
structures with names ending with 'tract are used to track
the instructions through the pipeline and each operation has
its own tracking registers. The structure logg-sheet is used

220

to monitor the 'tract structures. Each operation has its own


l o g g- s h e e t . O n e o f t h e s t r u c t u r e s o f t h e
multipurpose -registers is used as the control status
register. This register is used to pass control information
between the control functions.
Function

loadgipeline

'#:

The input control deals with the loading of the


arithmetic unit with the instructions from the structure
representing the delay station (struct delay-station). The
function scans the structure delay-station for instructions
that need to be initiated into the pipeline during each
iteration. The instructions that need to be executed are
checked with the available latency. The latency information
is available from the state matrix. If the number of
instructions to be initiated are one and the latency is
available, the instruction is initiated into the pipeline
unit. The token for the destination register is loaded into
the tracking register which tracks this instruction through
the stages. When more than one instruction contends for the
same stage and latency, the instruction with the higher
priority is initiated into the pipeline. The instruction
with the highest priority are those which are being
iterated. If no such instruction is present, then priority
is given to the instruction that has been in the structure
delay-station for the longest time. The instructions that
have not been initiated are reissued additional delays. The

221

counters in the structure instruction-status are updated


with this delay.
Function

output-check ":

The output control can be divided into the following


operations namely 1) non-divisional output control and 2)
division output control.
Non-divisional output control:
The non-division operations are addition, subtraction
and multiplication. The output control mainly deals with the
removal of the data from stage seven for the above mentioned
instructions. The information that the instruction has
reached stage seven is given by a tracking register assigned
for that particular instruction. For the multiplication
operation, the result is obtained when the tracking register
indicates stage seven. For addition and subtraction
operations, the result is obtained in the next cycle. The
result is loaded into the register specified by the tracking
register. The tracking register is initialized by the
function load-pipeline.
Division output control:
This deals with the operation of division only. When
the the tracking register pertaining to this instruction
indicates stage 7, the output of the stage six and stage
seven are taken from the pipeline and stored in the
structure priority-stack. The tracking register associated
with this instruction is made inactive. The tracking fields

are reset and the iteration counter is incremented. The


semi-processed instruction is given the highest priority and
will be initiated with the first available latency. The
number of iterations for the division instruction is fixed
at three. If the iteration counter is equal to 3 at the time
it is incremented, then the result is achieved and is
transferred to the appropriate register.
shift-trac

Function

I@:

T h e f u n c t i o n s h i f t- t r a c i s u s e d t o t r a c k t h e
instructions in the arithmetic unit. Each instruction that
is initiated is assigned a tracking register. The tracking
registers contain seven fields and each field represents a
stage. T h e tracking register will also contain the
destination register for the result of the instruction. When
an instruction is initiated at stage one, the tracking
register assigned to the instruction is initialized by
placing a token in field one. This function advances the
token to the next field denoting that the instruction has
moved to the next stage. When the token indicates that the
instruction is at the output stage, the function
output-check loads the result into the register specified
by the tracking register.
Function

I9

set-logw:

This function keeps track of all the tracking registers


that are in use. This function is used to update the
information of free tracking registers.

223

Function

time-off

I1

The function t ime-of f trans fers the data from the


output buffer of one stage i.nto the input buffer of the next
stage. This function maintains the data flow from one stage
to the other. The source code for emulating the PEU and PIU
is listed in appendix C.
5.4

COMPUTER GENERATION OF THE STATE DIAGRAMS:

The generation of the state diagram was implemented on


the VAX 11/750. The exact number of states cannot be
formulated as a general polynomial equation. However we can
calculate the maximum number of states that are possible by
choosing the number of rows and columns. Starting from the
three initial collision matrices (three initial states), the
various state matrices are generated. The program consists
of various functions that are used to generate the state
matrices.

The functions are aided by two integer pointers

that monitor the generation of the state matrices. The state


matrices are held in structures. The integer pointers are
briefly described below. The functions are briefly described
below.
Pointer

index ":

This is a integer pointer which is continuously updated


as new states are being generated. The function of this
pointer is to determine whether all the possible states have
been derived.

224

Pointer

pres-num

n:

This is also an integer pointer which counts down from


the state one. It is incremented after all the possible
states formulated from the current state have been derived.
When the value of countdown is equal to the pointer "index1I,
the program is terminated.
sleft-bits ":

Function

This function shifts left the present matrix under


consideration to the required number of times. The required
number of times is given by the latency. The leading zeros
are introduced from the right. The resulting matrix is
represented by a structure.
Function

"

or-cross

I*:

In this function, the state matrix which is stored as


a structure is ORed with the required initial collision
matrix. If the initiation is double initiation then the
combined initial matrix is derived and then it is ORed to
the current state matrix.
Function

*name-it

It:

This function is used to determine whether the new


state matrix is unique and that it does not have a copy of
itself, in the other state matrices that have been generated
earlier. This function generates the link list for the state
matrix under consideration in check-struc. The

I*'

in the

name indicates that this function is repeated for each


compatible initiation set.

225

Function

check struc ":

Function check-struc generates the new state matrices


from the current state. After each new state is created, it
is checked with the states that have been created. If a copy
of this state is not present, then the pointer index is
incremented and the state matrix under consideration is
assigned a new state. Each new state matrix is stored in a
structure. The structure is provided with fields which
correspond to all possible initiations. These fields are
used to store the address of the next state for that
particular initiation. The pointer pres-num indicates the
current state to be investigated. From each state all
possible new states are derived. The function returns with
all the new states. The address fields are used as link
lists and they contain the address of the states. The source
code for generating the state diagram is listed in appendix
B. The state diagrams are presented in appendix A.
5.5

EXPERIMENTAL RESULTS:

The simulation of the system was carried out on three


different instruction sets. The first instruction set
contained only the RAW hazard. The second instruction set
represented the RAW and W A W hazard. The third set has
incorporated the three hazards. The instructions that which
the program is capable of recognizing is shown in Fig. 5.4.
The format for each instruction is shown in Fig. 5.5. The
data is initially loaded as integers. It is converted to the

I
CPCCOE

NMrnICS

OPmm

N3P

NoOPERAm

ADD

ADDITION

SUB

SUBTRACTION

MULT

MULTIPUCATION

DIVIDE

DIVISION

SrCRE

STORE ( REGISTER -> MEMORY)

LCY\D

LOAD( REGISTER C- MEMORY )

LOAD1

LOAD ( MEMORY <- DATA )

INC

INCREMENT

EC

DECREMENT

10

AND

AND

11

CR

12

NCrr

NCrr

13

BRANCH

UNCONDITIONALBRANCH

14

BRANCF1NZ

BRANCH IF NOT ZERO

15

BRANCHNC

BRANCH IF NO CARRY

Fig. 5.4

The instruction set adopted for simulation.

instruction format adopted for


the arithmetic instructions:
ADDITION , SUBTRACTION, MULTIPLICATION, DIVISION

un
Source register1

instruction format adopted for


LOAD:

Source register2

the data transfer instructions:

STORE:
Memory
location

Memory
location

Destination
reaister

lnstructlon format adopted for


AND, OR :

mn

Destination
reaister

Source registerl

instruction format adopted for


INC, DEC, NOT :
Destination
realster

the logic instructions:

Source register2

the logic Instructions:

instruction format adopted for


BRANCH :

the logic instructions:

Destination
address

lnstruction format adopted for


BRANCHNZ, BRANCHNC :

the logic Instructions:

Fig. 5.5 Instruction format a d o p t e d in the s i m u l a t i o n p r o g r a m

binary form at the entrance to the execution unit. This is


done to simplify the program.
The first instruction set is listed in Fig. 5.6. The
result was obtained after 19 iterations. Each iteration
represents a single pipeline cycle. The results are
tabulated as shown in Fig. 5.6. The flow diagram is shown
in diagram 5.7. The results of the second and third
instruction sets are also illustrated in the same way. The
results of the second instruction set is shown in Figs. 5.8
and 5.9. The results of the third instruction set is shown
in Figs. 5.10 and 5.11. The timings coincide with the design
values.
The program is capable of handling ten instructions at
a time. It is now being modified to run for larger sets and
involving branch instructions. A sample set of instructions
listed in Smith [14] is being used to run the simulation.
The instruction set is the micro-code for a loop in Fortran.
The macro code is listed below:
10

Do 10 I = 1,100
A(1) = B(1) + C (I)*D(I)

The micro code for loop section of the macro code is given
below:
100: load r8, (C);
load r9, (D);
load r10, (B);
load rll, (A);
add r3, r8, r2;
add r4, r9, r2;
add r12, rll, r2;
mult r6, r3, r4;
add r7, r6, r5;

store (r12), r7;


dec r2;
branchnz 100, r2;
The space time diagram for the static scheduling and
execution of the sample set for two loops is shown in Fig.
5.12. The Fig. 5.13 illustrates the space time flow in the
proposed system. The flow in the figures 5.12 and 5.13
represents the hand simulation based on the proposed system.
Speed up is achieved due to the dynamic scheduling and
execution.

Instruction set # 1:
load r l , (X);
load r2, (Y);
add r3, r2, r l ;
store (Z), r3;

Location

(X) = 20
(Y) = 30

Iteration #

Contents of the location

r l

20

r2

10

30

r3

13

50

(z)

19

50
4

Fig. 5.6

Program results for instruction set # 1

Iteration #

load r l , (X);
load R, (Y);
add r3, r2, r l ;
store (Z), r3;

2 3

D I
F D
F

E
I
D
F

E E E E E
E E E E E E
l
D I

101112131415

Fig. 5.7 Space time flow of the instructions set # 1

E E

E
E E

Instruction set # 2:
load r l , (X);
load R , (Y);
add r3, r2, r l ;
store (Z), r3;
load
r3, (A)

Location

(X) = 20
(Y) = 30
(A) = 40

Iteration #

Contents of the location

r l

20

r2

10

30

r3

13

50

(2)

19

50

r3

18

40

Fig. 5.8

Program results for instruction set # 2

Iteration #
load r l , (X);
bad r2, (Y);
add r3, r2, r l
store (Z), r3;
load r3, (A);
Fig. 5.9

Space time flow of the instruction set #2

Instruction set # 3:
load r l , (X);
load r2, (Y);
add r3, r2, r l ;
add r4, r2, r l ;
store (Z), r3;
load
r3, (A)

(X) = 20
(Y) = 30

(A) = 40

Fig. 5.10 Program results for instruction set # 3

Fig. 5.1 1 Space time flow of the instruction set # 3

Iteration #

load r l O,(B);

add r3,r8,r2;
add r4,r9,r2;
add r5,r10,r2;
add r12,rll ,r2;
mutt r6, r3, r4;
add r7, r6, r5;
store (r12), r7;

branchnz 100, r2

Fig. 5.12

I I

Space time flow in case of static scheduling

Iteration #

load r8, (C);


load r9, (D);
load r l O,(B);
load r l l ,(A);
add r3,r8,r2;
add r4,r9,r2;
add r5,r10,r2;
add r 1 2 , r l l ,r2;
mutt r6, r3, r4;
add r7, r6, r5;
store (r12), r7;
dec r2;
branchnz 100, r2

Fig. 5.12

Space time flow in case of static scheduling

(Cont'd)

tteration #

load r8, (C);


load r9, (D);
load r l O,(B);

2 8 2 9 3 0 31 3 2 33 34 35 36 3 7 38 3 9 40 41 42

load r l l ,(A);
add r3,r8,r2;
add r4,r9,r2;
add r5,r10,r2;
add r12,r11 ,r2;
mult r6, r3, r4;
add r7, r6, r5;
store (r12), r7;
dec r2;
branchnz 100, r2

Fig. 5.12

Space time flow in case of static scheduling

(Cont'd)

Iteration #

4 3 4 4 45 46 47 48 49 5C51 5 2 5 3 5 4 5 5 5 6 5 7

load r8, (C);


load

a,(D);

load r l O,(B);
load r l 1,(A);
add r3,r8,r2;
add r4,r9,r2;
add r5,r10,r2;
add r 1 2 , r l l ,r2;

mult r6, r3, r4;


add r7, r6, r5;
store (r12), r7;

dec r2;
Sranchnz 100, r2

Fig. 5.12

Space time flow in case of static scheduling

(Cont'd)

Fig. 5.12

Space time flow in case of static scheduling

(Cont'd)

Iteration #

load r l l,(A);
add r3,r8,r2;
add r4,r9,r2;
add rS,r10,r2;
add r 1 2 , r l l ,r2;
mult r6, r3, r4;
add 77, r6, r5;
store (r12), r7;

branchnz

Fig. 5.13

100, r2

Space time flow in case of dynamic scheduling

Iteration #

1 6 1 7 18 19 2 0 2 1 2 2 23 2 4 2 5 2 6 2 7 2 8 2 9 3 0

load r8, (C);

load r9, (D);


load r l O,(B);
load r l 1,(A);
add r3,r8,r2;
add r4,r9,r2;
add rS1r10,r2;
add r 1 2 , r l l ,r2;
mutt r6, r3, r4;
add 17, r6, r5;
store (r12), r7;
dec r2;
branchnz

Fig. 5.13

100, r2

Space time flow in case of dynamic scheduling (Cont'd)

Iteration #

load r8, (C);


load r9, (D);
load r1O,(B);
load r l l , ( A ) ;
add r3,r8,r2;
add r4,r9,r2;
add rS,rlO,r2;
add r12,rll ,r2;
mult r6, r3, r4;
add r7, r6, r5;
store (r12), r7;
dec r2;
branchnz 100, r2

Fig. 5.13

Space time flow in case of dynamic scheduling (Cont'd)

Iteration #

3 8 3 9 4 0 41 42 43 4 4 4 5 4 6 4 7 4 8 49 5 0 5 1 5 2

load r8, (C);

E E
--

(D);
load 1'9,

load r l O,(B);

E E E E

load r l 1,(A);

--------

add r3,r8,r2;
add r4,r9,r2;
add r5,r10,r2;
add r12,rll ,r2;

mutt r6, r3, r4;


add r7, r6, r5;
store (r12), r7;
dec r2;
Dranchnz 100, r2

1
Fig. 5.13

Space time flow in case of dynamic scheduling (Cont'd)

Iteration #

load r8, (C);


load r9, (D);
load rl O,(B);
load rl 1 ,(A);
add r3,r8,r2;
add r4,r9,r2;
add r5,r10,r2;
add r12,rll ,r2;
muit r6, r3, r4;
add r7, r6, r5;
store (r12). r7;
dec r2;
branchnz 100, r2

Fig. 5.13

Space time flow in case of dynamic scheduling (Cont'd)

CHAPTER SIX
CONCLUSIONS AND DISCUSSION

In this research, we have presented an algorithm for


dynamic instruction scheduling in a pipelined system.
Initially, the instructions fetched by the fetch unit are
classified. This classification is carried out to ascertain
the type of the instruction. If the instruction is found to
be a jump instruction, the second stream is made
operational. The streams are used to reduce the branch
overheads. The fetch unit keeps track of the all the branch
instructions that have passed through it. This ensures that
the prefetching of instruction commences from the correct
location.
The instruction dependencies are solved by using the
pointers associated with the sink registers. The scheduling
of execution of the instructions are gauranteed hazard free
by the equations derived to resolve the hazards. The buffers
also aide the system by capturing the operands as they
become available. The missing operands are tagged with the
counter value. This eliminates the associative tag
comparisons proposed by Tumasulo [5] and Sohi and Vajepeyam
[GI

The execution unit also operates free of any hazard.


The state matrix ensures that the initiation is hazard free.
It also specifies the compatible initiations. The structure

243

of t h e arithmetic unit allows the execution of two


instructions simultaneously and hence increases the
throughput. T h e execution unit is also c a p a b l e of
rescheduling the scheduled instruction. This makes the
system flexible. This flexibility is needed to ensure hazard
free operation of the system.
The dynamic execution of instructions in a pipelined
environment is hardly used in any of the high performance
computers today. The control of such systems is complicated.
This carries the potential for longer control paths and
longer clock periods. This idea is making a comeback in new
generation RISC processors. The advancements in the VLSI
technology is making it possible to realize such systems.
Interrupt handling and indirect addressing modes have
not been taken into consideration. Furthermore, the design
of the floating point unit and the logic unit has not been
discussed. These areas remains as our further research
effort

REFERENCES

[ 11

Chen, T. C. llUnconventionalsuper speed computer


systems, in AFIPS 1971 spring Jt. Computer Conf ,
AFIPS Press, Montvale, N.J., 1971, pp. 365-371.

[21

McIntryre, D. "An introduction to the ILLIAC IV


computerfW Datamation, April 1970, pp.60-67.

C31

Evensen, A. J. and Troy, J. L. "Introduction to


the architecture of a 288-element PEPE," in Proc.
1973 Sagamore Conf. on Parallel Processing,
Springer-Verlag, N.Y. 1973, pp. 162-169.

141

Rudolph, J. A. "A production implementation of an


associative array processor-STARANfl@in AFIPS 1972
Fall Jt. Computer Conf., AFIPS press, Montvale,
N.J., 1972, pp. 229-241.

Tumasulo, R. M., l1An efficient algorithm for


exploiting multiple arithmetic units," IBM Journal
of Research and Development, January 1967, pp. 2533.
Sohi, G. S. and Vajapeyam, S. llInstructionissue
logic for high performance interruptable pipelined
processors," ACM, June 1987, pp. 27-34.
Keller, R. M. "Look ahead processors," Computing
surveys, Vol. 7, No. 4, December 1975, pp. 177195.
Dennis, J. B. ltModular,asynchronous control
structures for a high performance processor,It ACM
Conf. Record, Project MAC Conf. on concurrent
systems and parallel computation, June 1970,
pp. 55-80.
[91

Tjaden, G. S. and Flynn, M. J. "Detection and


parallel execution of independent instruction^,^^
IEEE Trans. Computers, Vol. c-19, No.10, October
1970, pp. 889-895.

[ 101

Ramamoorthy, C. V. and Kim, K. H. "Pipelining-the


generalized concept and sequencing strategiesfVt
Proc. NCC. 1974, pp. 289-297

Smith, J. E. and Weiss, S. "Instruction issue


logic for super pipelined computers," IEEE Trans.
Computers, Sept 1984, pp. 110-118.
Thorton, J. E. Design of a computer - the Control
Data - 6600, Scott, Foresman and Co., Glenview,
IL, 1970
De~erell,J.~~Pipeline
iterative arraysImIEEETrans.
Computers, Vo1.c-23, No.3, March 1975, pp.317-322.
Smith, J. E. I1Dynamic instruction scheduling and
Astronautics ZS - 1," Computer, July 1989, pp. 2135.
Ramamoorthy, C. V. and Li, H. F. I1Pipelined
architecture^,^^ Computing Surveys, Vol - 9, No.
1, March 1977, pp. 61-101.
Ramamoorthy,C.V. and Li, H.F."Efficiency in
generalized pipeline network^,^^ National Computer
Conference, 1974, pp.625-635.
Sze, D. T. and Tou, J.T. I1Efficientoperation
sequencing for pipeline machines," Proc. COMPCON,
IEEE NO. 72CH 0659 - 3C, 1972, pp 265-268.
Davidson E. S. "Scheduling f o r pipelined
processors," Proc. 7th Hawaii Conf.on System
Sciences, 1974, pp. 58 - 60.
Shar, L. E. nDesign and scheduling of statically
configured pipelines,I1 Digital Systems, Lab Report
SU-SEL-72-042, Stanford University, Stanford, CA,
September 1972.
Patel J. H. and Davidson E. S. "Improving the
throughput of a pipeline by insertion of delays,
IEEE/ACM 3rd Ann. Symp. Computer Arch., IEEE No.
76CH 0143 - 5C, 1976, pp 159-163.
Thomas A. T. and Davidson E. S. llSchedulingof
multiconfigurable pipelines,I1 Proc. 12th Ann.
Allerton Conf. Circuits and System Theory, Univ.
of Illinois, Champaign - Urbana, 1974, pp. 658669.

1221

Anderson, S.F., Earle J.G., Goldschmidt, and


P o w e r s D.M.,
T h e I B M S y s t e m / 3 6 0 M o d e l 91:
Floating Point Execution Unit, " IBM J. Res. Dev.,
January 1967, pp 34-53.

[231

H w a n g K., a n d B r i g g s F . A., It C o m p u t e r
Architecture and Parallel Processing," McGraw-Hill
Book Company, 1984.

APPENDIX

A.

B.

Cross Collision Matrices

Computer Program for Generating State Diagrams


C.

Simulation Program

APPENDIX A.

Cross Collision Matrices

APPENDIX B.

Computer Program for Generating State Diagrams

/*

generation of cross collision matrices

*/

# include <stdio.h>
# include <math.h>
# define true 1
# define false 0
struct matrix
{

int bits rowl[8];


int bitsMrow2 [8];
int bits"row3
[8];

>;

struct direction
{

int
int
int
int
int

div latency[8];
mult latency[8];
add iatency[8];
div-add[8];
mult-add[8];

1:

struct ident
{

int name ;

1:

struct collision-matrix
{

struct matrix smatrix;


struct direction sdirection;
struct ident sident;

1;

struct collision-matrix binary matrix[l50];


struct collision-matrix f~rpresent~upto
-next,last-temp;
int index, pres-num;

/*

i*
/*

FUNCTION FOR INITIALIZING

/*
/*

*/

*/
*/
*/
*/

struct collision-matrix init-cross(now)


struct collision-matrix now;
{

static struct collision-matrix new =


{

1
f

0
);

now = new;
return (now);

*/

/*
/*
/*
/*
/*

*/

Function for oring of matrices

*/
*/

*/

struct collision matrix or cross(matrix -ofmatrix-two)


struct collision~matrixmatrix o;
matrix-two;
struct collision-matrix
{

struct collision matrix matrix-one;


int j,k,l,m~
matrix one = init-cross(matrix one);
upto next = init-cross (upto-next) ;
for(j=o?j<8;++j)
{

matrix-one.smatrix.bits rowl[j]
=(matrix
o.smatrix.bits -rowl[j]
I
matrix-two.sma~rix.bits
-rowl[j]);
matrix-one.smatrix.bits row2[j]
=(matrix
o.smatrix.bits-row2[j]
I
matrix-two.smatrix.bits -row2[j]);
matrix-one.smatrix.bits row3[j]
=(matrix
o.smatrix.bits -row3[j]
I
matrix-two.sma~rix.bits
-row3[j]);
)

matrix o = matrix one;


upto next = matrix one;
return ( matrix-oneymatrix-two) ;
1

/*
/*
/*
/*
/*

FUNCTION FOR SHIFTING THE COLLISION BITS

struct collision matrix sleft bits(present,number)


struct collision-matrix
present;
int number;
{

*/
*/
*/
*/
*,/

struct collision-matrix use-once;


int left;
use once = init cross(use once);
for(left=o;left<(8
- number);++left)
{

use once.smatrix.bits rowl[left]


present.smatrix.bits-rowl[left + number];
use once.smatrix.bits row2[left]
present.smatric.bits -row2[left + number];
use once.smatrix.bits row3[left]
present.smatrix.bits -row3[left + number17
1
present = use once;
return (present);
1

/*
/*
/*
/*
/*
/*

FUNCTION NAME IT

*/
*/
*/
*/
*/
*/

=
=

s t r u c t
c o l l i s i o n
m a t r i x
dname it(present,coming,ineex,pre nutrepeat)
struct collision-matrix present,corning[];
int ineex,repeat,pre-nu;
f

struct collision matrix temp,templ,temp2[];


int i,j,k,l,m,find-sucess,number,consider;
unsigned int flag;
flag = 0;
find sucess = 0;
number = index;
consider = pres num;
for(i=l;i<=number;++i)
{

for(j=O;j<8;++j)
I

flag
flag

=
=

flag << 1;
flag I 1;

flag
flag

=
=

flag << 1;
flag I 0;

else
{

if (flag

==

255)

binary-matrix[consider].sdirection.div-latency[repeat]
find sucess
breaE;

i;

1;

1
else
{

flag =O;

1
1

if (find-sucess != 1)
{

binary matrix[number+l] = present;


binarymatrix
[number+l].sident.name

binary matrix[consider].sdirection.div
number-+ 1;
index = number + 1;
1
return;

number

+ 1;

-latency[repeat]=

/*
/*
/*
/*
/*
/*

FUNCTION NAME IT

*/
*/
*/
*/
*/
*/

s t r u c t
c o l l i s i o n
m a t r i x
mname it(present,coming,ineex,pre nutrepea%)
struct collision matrix present1c6ming[];
int ineex,repeatTpre-nu;
{

struct collision matrix temp,templ,temp2[];


int i,j,k,l,m,fi~d
-sucess,number,consider;

unsigned int flag;


flag = 0;
find sucess = 0;
number = index;
consider = pres num;
for(i=l;i<=number;T+i)
{

for(j=O;j<8;++j)
{

if (((present.smatrix.bits rowl[j] ==
c o m i n g [ i ] . s m a t r i x . b i t s - rowl-[j])
& &
( p r e s e n t . s m a t r i x . b i t s row2[j]
= =
coming[i].smatrix.bits
row2[j]))
& &
( p r e s e n t . s m a t r i x . b i t s-row3[j]
= =
coming[i].smatrix.bits -row3[j]))
{

flag = flag << 1;


flag = flag I 1;

1
else
{

flag = flag << 1;


flag = flag I 0;

1
1
if (flag

==

255)

binary-matrix[consider].sdirection.mult -latency[repeat]=
i;
find sucess = 1;
break;
1
else
{

flag =O;

1
if (find-sucess != 1)
{

binary matrix[number+l] = present;


binarymatrix[number+l].sident.name
-

number + 1;

binary matrix[consider].sdirection.mult -latency[repeat]=


number-+ 1;
index = number + 1;
}

return;

*/

/*
/*
/*
/*
/*
/*

FUNCTION NAME IT

*/
*/
*/
*/
*/

s t r u c t
c o l l i s i o n
m a t r i x
aname it(present,coming,ineex,pre nutrepea?)
struct collision matrix present,c~ming[];
int ineex,repeat,pre -nu;
{

struct collision matrix temp,templ,temp2[];


int i,j,k,l,m,fi~d
-sucess,number,consider;
unsigned int flag;
flag = 0;
find sucess = 0;
number = index;
consider = pres num;
for(i=l;i<=number;T+i)
{

for(j=O;j<8;++j)
{

if (((present.smatrix.bits rowl[j] ==
c o m i n g [ i ] . s m a t r i x . b i t s - rowl-[j])
& &
( p r e s e n t . s m a t r i x . b i t s
row2[j]
= =
coming[i].smatrix.bits
row2[j]))
& &
( p r e s e n t . s m a t r i x . b i t s-r o w 3 [ j ]
= =
coming[i].smatrix.bits -row3[j]))
{

flag
flag

=
=

flag << 1;
flag I 1;

flag
flag

=
=

flag << 1;
flag 1 0;

else
{

1
1
if (flag

==

255)

binary-matrix[consider].sdirection.add -latency[repeat]
find sucess = 1;
break;

1
else
{

flag =O;

if (find-sucess != 1)

i;

binary-matrix[number+l] = present;
binary-matrix[number+l].sident.name
binary matrix[consider].sdirection.add
number-+ 1;
index = number + 1;

number

+ 1;

-latency[repeat]=

return;

/*
/*
/*
/*
/*
/*

FUNCTION NAME IT

*/
*/
*/
*/
*/
*/

s t r u c t
c o l l i s i o n
m a t r i x
daname it (present,coming,ineex,pre nu,repeat)
struct-collision matrix present,coming[];
int ineex,repeat:pre
-nu;
{

struct collision matrix temp,templ,temp2[];


int i,j,k,l,m,fi~d
-sucess,number,consider;
unsigned int flag;
flag = 0;
find sucess = 0;
number = index;
consider = pres num;
for(i=l;i<=number;T+i)

flag
flag

flag << 1;
flag I 1;

1
else
{

flag = flag << 1;


flag = flag 1 0;

1
if (flag

==

255)

binary-matrix[consider].sdirection.div -add[repeat]
find sucess = 1;
break;
1
else
{

i;

flag =O;

1
if (find-sucess != 1)
{

binary matrix[number+l] = present;


binarymatrix[number+l].sident.name
= number

1;

binary-matrix[consider].sdirection.div -add[repeat]= number


+ 1;
index = number + 1;
1
return;

/*
/*
/*
/*
/*
/*

FUNCTION NAME IT

*/
*/
*/
*/
*/
*/

s t r u c t
c o l l i s i o n
m a t r i x
maname it(present,coming,ineex,pre nutrepeat)
struct-collision-matrix present,coming[];
int ineex,repeat,pre-nu;
{

struct collision matrix temp,templ,temp2[];


int i,j,k,l,m,fi~d
-sucess,number,consider;
unsigned int flag;
flag = 0;
find sucess = 0;
number = index;
consider = pres num;
for(i=l;i<=number;T+i)
{

for(j=O;j<8;++j)
{

if (((present.smatrix.bits rowl[j] ==
c o m i n g [ i ] . s m a t r i x . b i t s - rowl-[j])
& &
(present.smatrix.bits
row2[j]
= =
coming[i].smatrix.bits -row2[j]))
& &

( p r e s e n t . s m a t r i x . b i t s -row3[j]
coming[i].smatrix.bits -row3[j]))
{

= =

flag = flag << 1;


flag = flag I 1;

else
{

flag
flag

=
=

flag << 1;
flag I 0;

if (flag

==

255)

binary-matrix[consider].sdirection.mult
find sucess = 1;
breaE;

-add[repeat]

i;

else
{

flag =O;

if (find-sucess != 1)
{

binary matrix[number+l] = present;


binarymatrix
[number+l] sident.name

number

binary-matrix[consider].sdirection.mult -add[repeat]=

1;

index

number

+ 1;

number

1;

return ;

/*
/*
/*
/*
/*
/*
/*

*/

FUNCTION TO GENERATE AND CHECK


THE STRUCTURES FOR NON REPITITION

struct collision matrix check struc(put,inex,pr-nu)


struct collision-matrix
put[]?
int inex,pr-nu;
{

*/
*/

*/
*/
*/

*/

s t r u c t
c o l l i s i o n - m a t r i x
temp-struct,sec struc,third-struc;
int i,j,k,~,consider;

consider = pres num;


temp struct = init-cross(temp-struct);
temp-struct = put[consider];
sec struc = temp struct;
/* generation of-new structures */
for(j=O;j<8;++j)
{

if (temp-struct.smatrix.bits-row1[j] == 0)
{

struc
sec-struc
-

= sleft bits(sec struc,j);


= or cross(sec ~truc,~ut[l]);
set struc = upto next;
dname-it(sec-struc,put,in&,consider,j)
;

set

1
sec-struc = temp-struct;
1

struc = sleft bits(sec struc,j);


sec-struc
= or cross(sec struc,put[2]);
set struc = upto next;
mname-it(sec-struc,put,in&,consider,j);
set

1
sec-struc = temp-struct;

struc
sec-struc
-

= sleft bits(sec struc,j);


= or cross(sec struc,put[3]);
set struc = upto next;

set

aname-it(sec-struc,put,in~consider,j);

sec-struc

temp-struct;

if((temp struct.smatrix.bits-rowl[j]== 0)
(temp-struct.smatrik.bits-row3 [ j] == 0))
{

struc = sleft bits(sec struc,j);


sec-struc
= or cross(sec struc,put[l]);
set struc = upto next;
sec-struc = or-cross(sec-struc,put[3]);
set

&&

sec-struc = upto next;


daname-it(sec-struc,put,i~ex,consider,j);

sec-struc = temp-struct;
}

for(j=O;j<8;++j)
{

if((temp struct.smatrix.bits row2[j]== 0)


(temp-struct.smatri5.bits-row3 [ j] == 0))

&&

struc
sec-struc
-

= sleft bits(sec struc,j);


= or cross(sec struc,put[2]);
set struc = upto next;
sec-struc = or cross(sec struc,put[3]);
set struc = upto next;

set

maname-it(sec-struc,put,i~ex,consider,j);

sec-struc

temp-struct;

return;

1
main ( )
{

struct collision-matrix say-one, say-two, say-three;


int i,j ,k,l,v;
pres num = 1;
index = 3;
binary matrix[l] = init cross(binary matrix[l]);
binarymatrix[2] = init-cross(binary-matrix[2]);
binarymatrix[3] = init-cross(binary-matrix[3]);
binary-matrix[l].smatrikbits rowl[07 = 1;
binary-matrix[l].smatrix.bits-rowl[l] = 1;
binary-matrix[l].smatrix.bits-rowl[2]
= 1;
binary-matrix[l].smatrix.bits-row2[0]
= 1;
binary-matrix[l].smatrix.bits-row2[1] = 1;
binary-matrix[l].smatrix.bits-row2[2] = 1;
binary-matrix[l].smatrix.bits-row3[5] = 1;
binary-matrix[l].smatrix.bits-row3[6] = 1;
binary-matrix[l].smatrix.bits-row3[7] = 1;
binary-matrix[2].smatrix.bits-rowl[0]
= 1;
binary-matrix[2].smatrix.bits-row2[0] = 1;
binary-matrix[2].smatrix.bits-row3[5]
= 1;
binaryrnatrix[3].smatri~.bits-row3[0]
= 1;
binary-~natrix[l].sident.name = 1;
binary-matrix[2].sident.name = 2;
binary-matrix[3].sident.name
= 3;

while( pres-num <= index )


{

check-struc(binary matrix,index,pres-num);
pres-num = pres-nui + 1;
printf(

the various structures are tabulated

below \nu);
printf (!I\nw);
for (v=l; v <= index; v++)
{

for (1=0;1<8;++1)
{

ff,binary
-matrix[v].smatrix.bits

printf (
rowl[l]);
-

" % d \b

printf ("\nff)
;
printf (ff\n")
;
for (1=0;1<8;++1)
{

Iffbinary
-matrix[v].smatrix.bits
printf (If\nw)
;
printf (ff\nff)
;

printf (
row2[1]);
-

If%d \b

1
for (1=0;1<8;++1)
{

ff,binary
-matrix[v].smatrix.bits

printf (
row3[1]);
-

" % d \b

printf ("\nff)
;
printf iff\nff)
;
printf (:*\riff) ;
for (1=0;1<8;++1)
{

printf ("\nff)
;
printf (If \nw) ;
for (1=0;1<8;++1)
{

printf (
",binary-matrix[v].sdirection.mult -latency[l]);
printf (If\nff)
;
printf ( fl\nff)
;

If%d \b

for (1=0;1<8;++I)
{

printf (
",binary-matrix[v].sdirection.add -latency[l]);

If%d \b

1
printf ("\nW);
printf ("\nI1);
for (1=0;1<8;++1)
{

printf (
",binary-matrix[v].sdirection.div~add[l]);

I1%d \ b

1
printf (l1\n");
printf ("\nn);
for (1=0;1<8;++1)
{

printf (
lt,binary-matrix[v].sdirection.mult -add[l]);

I1%d \ b

1
printf ("\nu);
printf ("\nV1)
;
printf (
binary matrix[v].sident.name);
printf7l1\n1l)
;
printf ("\nW);
printf ("\nn);
printf (If\n");
printf ("\nut)
;
printf (tt\nN)
;

I1%d \ b

I f I

APPENDIX C.

simulation Program

................................................
/***** SIMULATION OF DYNAMIC ARITHMETIC *****/
/*****
*****/
/*****
PIPLINE
*****/
/*****
*****/
/*****
VERSION 1.0
*****/
................................................
................................................
..................................................
..................................................
/ * In this program The Eigth bit is stored in 0 */
/ * and the First Bit is stored in 8
*/
..................................................
..................................................
#
#
#
#

include <stdio.h>
include <math.h>
define true 1
define false 0

/*
*/

the structure initializations for the instruction unit

struct input-inst
{

int
int
int
int
int

opcode field;
source-operandl;
source-operand2;
dest operand;
valid;

1;
struct instruction-status
{

1;

int
int
int
int
int
int
int
int

inst num;
pipe-cyl;
opcode;
exec time:
reg-;ti1 [5] ;
count units[5]
decode-ptr;
issue-ptr;

struct reg-file
{

int reg-units[5];

1;
struct status-reg
{

int
int
int
int

carry;
overflow;
sign;
zero;

>;

struct address-counter
C
int counter[20];
int free-index;
>;

struct issue-latch
{

>;

int
int
int
int
int
int
int
int
int

opcode fld;
dest fid;
source1 fld;
srcldata fld;
srcldelay fld;
source2 fid;
src2data fld;
src2delay fld;
instdelay-fld;
-

struct dstack-status
C
int queue select;
int full queue;
int flush flag;
int top stack;
int bottom stack;

1;
struct fetch-status
C
int flush flag;
int address flag;
int picqueue full;
int eacqueue-full;

struct matrix
C
int bits rowl[8];
int bits-row2 [ 8];
int bits-row3
[8];
1;
struct direction
C

int div-latency[8];

int
int
int
int

mult latency[8];
add iatency[8];
div-add[8];
mult-add[8];

1;
struct ident
{

int name;

1;
struct collision-matrix
{

struct matrix smatrix;


struct direction sdirection;
struct ident sident;

1;
struct recode
{

int

bits[15] ;

1;
struct reg-stages
{

int

word [ 17];

1:

struct div-track
{

char name one[3];


int number;
int st track[10];
int itr track;
int address;
1;
struct mult track
{

char name two[4];


int number;
int st track[lO];
int address;

1;
struct add-track
{

char name three[3];


int number;
int st track[lO];
int address;

1;

struct logg-sheet
{

int logg[10] ;
int logg-stat;

1;
struct input-process
{

int
int
int
int
int
int

location;
func;
num one[lO];
num-two[l0];
over flow;
weight;

1:

struct itr-storage
{

int
int
int
int

address;
func;
num one[10];
num-two[l0];
-

1;
struct output-process
{

int
int
int
int

destination;
overflow;
result[l7];
wt-factor;

1;
typedef
typedef
typedef
typedef
typedef
typedef
typedef
typedef
typedef
typedef

struct
struct
struct
struct
struct
struct
struct
struct
struct
struct

collision matrix struct0;


recode structl;
reg stages struct2;
div-track struct3;
mult track struct4;
add track struct5;
logg sheet struct6;
input process struct7;
output process struct8;
itr-storage struct9;

typedef
typedef
typedef
typedef
typedef
typedef
typedef
typedef

struct
struct
struct
struct
struct
struct
struct
struct

input inst structi0;


instruction status structil;
reg file structi2;
status reg structi3;
address counter structi.4;
issue latch structi5;
fetch-status structi6;
dstack-status structi7;

structiO memory[100],decode stackl[20],decode stack2[20];


structi0 *memory ptr,*dstackl ptr,*dstack2 ptr;
structiO iunit l a t c h e s [ 2 ~ ] , i n t e r n a r-h o l d e r s [ 2 0 ] ,
*ilatch ptr,*inhold ptr;
structiy status unit[100], *statusu ptr;
structi2 gp-reg'lster, *gp-ptr; / * general purpose register

*/

structi3 register sr,*sr-ptr;


structi4 pgm-coun~erlIpgm~counter2,*pgm
-ptrl,*pgm-ptr2;

structi5 isunit latch,*isunit ptr;


structi6 stream-status, *sstatus ptr;
s t r u c t i 7 p i c q u e u e statu-s,eacqueue
*picstatus-ptr,*eacstatus-ptr;

-s

tatus,

int queue-select,current-queue,disable-decode,disable-issue;
struct collision matrix binary matrix[l50];
struct collision-matrix for present,upto next,last temp;
structl argumentT[20],argum&1t2[20], multipurpose-reg[20],
*mpreg ptr;
structF latches[30],par product[l0],transfer[30],delay[20];
struct3div follow[l~],delta track[lO],*divflow -ptr,
*deltaflow ptr;
struct3 *copy seven,*copy eight;
struct4 mult follow[10],*~ultflow ptr,*copy nine;
struct5 a d 3 follow[lO] , *addflow-ptr, *copy- ten,
sub follow[l~~,
*subflow ptr;
strkct6 div-logg,mult-Togg,add-logg,process-logg[lO],
*prlogg-ptr;
struct7 input stack[41],*instack ptr,*copy eleven;
struct8 output stack[41],*outstack ptr,*copy-twelve;
struct9 priority ~tack[70],*~rstackptr;
structO *bin pointer;
structl *argi pointer,*arg2 pointer,*copy one,*copy-two;
s t r u c t 2 *par pointer,*lat p o i n t e r , *copy-four,
*copy three,*trans pointer,*copy-five;
struct2 *delay ptr, *copy six;
int op code[20r,arg one[2<][9],arg two[20][9];
int *ptr op,*ptr argmntl[20],*ptr argmnt2[20];
int index, pres ium,stk ptr,total,~ultiplication,division;
int varl,var2,v~r3,var~init
-key, addition, subtraction,
delta-flag;
int
global one[20],global -two[20],
global-three [20],readjust;
-

/*

Functions of the instruction unit

/*
/*
/*

Instruction Fetch Unit

*/
*/

*/

*/

structiO fetch unit( ptrl,ptr2,ptr3,ptr41ptr51ptr6)


structiO *ptrl, *ptr2 ;
/ * pointer to the Memory

*/

structi4 *ptr3,*ptr4;
counters */
structi6 *ptr5, *ptr6 ;
current queue in session

/*
/*
*/

pointers to the address

The flag which denotes the

/*

int i,j,k,l,transfer flag1,transfer flag2;


int program ~ounterl7~ro~ram
-counter2,queue-select;
(ptr2+1)->valid = 0 ;
(ptr2+2)->valid = 0;
To check and flush the redudant queues
*/
if ( (ptr5)->flush flag

== 1)

printf("f1ush flag is enabled\nw);


if ( (ptr5)->address-flag == 1)
{

/*

Flush PIC stream

*/

printf (99flushPIC stream \nl@)


;
(ptr3)->counter[O] = 0;
for (i=l;i<=g;i++)
{

(ptr4)->counter[i]

0;

(ptr4)->free index = I;/* setting the index


flag of the counter2 to 1 so as to indicate that the
counters are flushed and the counter that has to be filled
first is counter[l] */
1
if((ptr5)->address -flag == 2)
{

/*

Flush EAC stream

*/

(ptr4)->counter[O] = 0;
for (i=l;i<=g;i++)
{

(ptr3)->counter[i] = 0;
1
(ptr3)->free index = I;/* setting the index
flag of the counterl to 1 so as to indicate that the
counters are flushed and the counter that has to be filled
first is counter[l] */
1
1
/ * reading the memory for instructions */
/* instructions will be fetched if the program counter

*/

/*

of both the individual streams are non zero

program counterl
program-counter2
-

/*
==

=
=

*/

(ptr3)->counter[O];
(ptr4)->counter[O];

Fetching of instructions for PIC stream

if ( (program-counter1 != 0)
0))

&& (

*/

(ptr5)->picqueue-full

( p t r 2 + 1 ) - > o p c o d e - f i e l d
(ptrl+program counter1)->opcode-field;
( p t r Z + l ) - > s o u r c e - o p e r a n d 1

(ptrl+program counter1)->source-operandl;
( p t r Z + l ) - > s o u r c e _ o p e r a n d 2 =
(ptrl+program counter1)->source operand2;
( p t r A 2 + 1 ) - > d e s t
o p e r a n d
(ptrl+program counter1)->dest operand;
transferAflagl = I;/* vaiid instruction and pass it to
decode unit */
(ptr3)->counter[O]+=l;
(ptr2+1)->valid = 1;
1

/*
==

Fetching of instructions for EAC stream

if((program -counter2 != 0)
0))

&&

*/

((ptr5)->eacqueue-full

( p t r 2 + 2 ) - > o p c o d e - f i e l d
(ptrl+program counter2)->opcode field;
( p t r z + 2 ) - > s o u r c e - o p e r a n d 1

(ptrl+program counter2)->source-operandl;
( p t r Z + 2 ) - > s o u r c e
o p e r a n d 2
=
(ptrl+program counter2)->source operand2;
( p t r - 2 + 2 - > d e s t
o p e r a n d
(ptrl+program counter2)->dest operand;
transferflag2 = I;/* val'ld instruction and pass it to
decode unit */
(ptr4)->counter[O]+=l;
(ptr2+2)->valid = 1;
1

*/
!=

/*
/*

classifying the instruction

*/

checking for jump instructions in the PIC stream

if (((ptr2+1)->opcode-field >= 13)


0))

&&

(transfer-flag1

printf(" there is a branch instruction detected


in the PIC stream\ntl)
;
( p t r 4 ) - > c o u n t e r [ ( p t r 4 ) - > f r e e- i n d e x ]
(ptr2+1)->dest operand;
(ptrz)->free-index += 1;

*/

/*

checking f o r jump i n s t r u c t i o n s i n t h e EAC s t r e a m

i f ( ( ( p t r 2 + 2 )->opcode-f i e l d >= 1 3 )

&&

( t r a n s f e r-f lag2

!= 0 ) )
{

p r i n t f ( " t h e r e i s a branch i n s t r u c t i o n d e t e c t e d
i n t h e EAC s t r e a m \ n t l ) ;
( p t r 3 ) - > c o u n t e r [ ( p t r 3 ) - > f r e e- i n d e x ]
( p t r 2 + 2 ) - > d e s t operand;
( p t r 3 ) - > f r e e-index += 1;

p r i n t f ( I t t h e i n s t r u c t i o n f e t c h e d from memory f o r P I C
s t r e a m \ntl ) ;
p r i n t f ( " o p c o d e o f p t r 2 + 1 i s
\nl1, ( p t r 2 + 1 )->opcode-f i e l d ) ;
p r i n t f (llsource operand1 of p t r 2 + 1
\nw , ( p t r 2 + l )- > s o u r c e-o p e r c n d l ) ;
p r i n t f (llsource operand2 of p t r 2 + 1
\nw, ( p t r 2 + 1 )->source operand2) ;
p r i n t f (llde-st-operand
o f p t r 2 + 1

% d

%d
%d
% d

p r i n t f ( I t t h e i n s t r u c t i o n f e t c h e d from memory f o r EAC


stream \nu) ;
p r i n t f ( I 1 o p c o d e o f p
i s
- t r 2 + 2
\nt1, ( p t r 2 + 2 ) ->opcode-- f i e l d ) ;
p r i n t f (llsource operand1 of ptr2+2
\nw , ( p t r 2 + 2 )->source -o p e r a n d l ) ;
p r i n t f (llsource operand2 of ptr2+2
\ntl , ( p t r 2 + 2 )->source operand2) ;
p r i n t f ("de-st-operand
o f p t r 2 + 2
\nM, ( p t r 2 + 2 )- > d e s t -operand) ;

% d

%d

%d
% d

p r i n t f ( I 1 t h e program c o u n t e r s a r e l i s t e d below \ n t l ) ;
f o r (i=O;i<=g;i++)
{

p r i n t f ( I 1 t h e v a l u e of c o u n t e r %d of PIC s t r e a m i s
%d \ n t l , i , ( p t r 3 )- > c o u n t e r [ i ] ) ;

1
for (i=O;i<=g;i++)
{

p r i n t f ( I 1 t h e v a l u e of c o u n t e r %d of EAC s t r e a m is
%d \nw,i, ( p t r 4 ) - > c o u n t e r [ i ] ) ;

*/
*/

/*

/*
/*
/*
/*
/*

Function to load the instruction


status unit for the PIC stream

*/
*/

*/

*/

void load i~unitl(ptrl,ptr2,ptr3~ptr4,ptr5,ptr6)


structi0 zptrl,*ptr2 ;
structil *ptr3;
structi7 *ptr4,*ptr5;
int ptr6;
I
int i,j,k,l,bottom stack1,top-stackl;
bottom stackl = ptr6;
for (izl;i<=5;i++)
{

(ptr3+(ptr3)->decode-ptr)->reg-util [i]= 3 ;
)

switch((ptrl+bottom -stack1)->source-operandl)
{

case 1:
(ptr3+(ptr3)->decode-ptr)->reg-util [1]=
break;
case 2:
(ptr3+(ptr3)->decode-ptr)->reg-util[2]=
break;
case 3:
(ptr3+(ptr3)->decode-ptr)->reg -util[3]=
break;
case 4:
(ptr3+(ptr3)->decode-ptr)->reg-util[4]=
break;
case 5:
(ptr3+(ptr3)->decode-ptr) ->reg-util[5] =
break;

1;
1;
1;
1;

1;

1
switch((ptrl+bottom -stack1)->source-operand2)
{

case 1:
(ptr3+(ptr3)->decode-ptr) ->reg-util [l]= 1;
break;
case 2:
(ptr3+(ptr3)->decode-ptr) ->reg-util[2 ]= 1;
break;

>
>
C

-4 I
3
U 0

C, cd

>>Z$>>Z
C, m
m

Q)

G 0
Q)

"5

20 0k
rl

0 C,

C, -4

2U
m

(d
C,

-a
\O
ka
C, F:
a fd
m

kX
U

C,

-4J

a-4Jfd
a rn
ka
C,
ao

F:

k
C, Q)
a+,

k O

ru -4
C, a
a- *
d,

k *a k

C,cu

-C,

-4

F:

..
m
k

C,

l k k k

in \O
Ik
W"o a+
C,

u II -4
0 *X II
d U
fd -4
V

-4

-4

-n II

-U)d

XC, *-

arum
%?
- . *aa. -- EI

uuua

rlC,""C,

0-4 -4 -4 k

" mu+u
0
-4 m
-4 a a a
C, 5
I* * *
a
..
cdOdbQ
UC,
C
fd
5 C,
m

Fr

5 5 3 3

4k kk"
* * * * * * -OC,C,C,
F:
m m m-4
-\\\\\\$

(ptr3+(ptr3)->decode-ptr)->reg-util [1]=
break;
case 2:
(ptr3+(ptr3)->decode-ptr) ->reg-util[2]=
break;
case 3:
(ptr3+(ptr3)->decode-ptr)->reg-util[3 I=
break;
case 4:
(ptr3+(ptr3)->decode-ptr) ->reg-util[4]=
break;
case 5:
(ptr3+(ptr3)->decode-ptr) ->reg-util[5]=
break;

1;
1;
1;
1;
1;

switch((ptr2+bottom-stack2)->source-operand2)
{

case 1:
(ptr3+(ptr3)->decode-ptr) ->reg-util [I]=
break;
case 2:
(ptr3+(ptr3)->decode-ptr) ->reg_uti1[2]=
break;
case 3:
(ptr3+(ptr3)->decode-ptr)->reg_util[3]=
break;
case 4:
(ptr3+(ptr3)->decode-ptr) ->reg_util[4]=
break;
case 5:
(ptr3+(ptr3)->decode-ptr) ->reg-util[5] =
break;

1;

1;
1;

1;
1;

1
case 1:
(ptr3+(ptr3)->decode-ptr)->reg-util[l]=
break;
case 2:
(ptr3+(ptr3)->decode-ptr)->reg_util[2 ]=
break;
case 3:
(ptr3+(ptr3)->decode-ptr) ->reg_util[3 ]=
break;
case 4:
(ptr3+(ptr3)->decode-ptr)->reg-util[4]=
break;
case 5:
(ptr3+(ptr3)->decode-ptr) ->reg_util[5]=
break;

0;
0;
0;
0;
0;

/*
/*
/*

1
Decode unit

*/
*/

*/

s
t
r
u
c
t
i
0
decode-unit(ptrl,ptr2,ptr3,ptr4,ptr5,ptr6,ptr7,~tr8,~trg)

structiO *ptrl,*ptr2,*ptr3; / * input from latch, dstackl,


dstack2 */
structil *ptr4; / * sytem status pointer */
structi7 *ptr5,*ptr6;
/* i n t e g e r p o i n t e r s t o
dstackl,dstack2 */
structiO *ptr7;/* general purpose elements */
{

int i,j,k,l;
int top stack1,bottom stack1,top-stack2,bottom-stack2;
top stack1 = (ptr5)-?top-stack;
bottom stack1 = (ptr5)->bottom stack;
top stack2 = (ptr6)->top stack?
bottom-stack2 = (ptr6)->bottom-stack;

/*

Loading of the PIC queue

*/

if ( ( (ptr5)->queue-select == 1)

&&

(ptr5)->full-queue

!= 1))
{

/* check to see whether the instruction is a valid


instruction for PIC stream */
if ( (ptrl + 1)->valid != 0)
{

/*

instruction is valid

*/

( p t r 2 + t o p- s t a c k 1 ) - > o p c o d e - f i e l d =
(ptrl+l)->opcode field;
(ptr2+top- stack1)->source- operand1 =
(ptrl+l)->source operandl;
(ptr2+top- stack1)->source- operand2 =
(ptrl+l)->source operand2;
(ptr2+top_stackl)->dest-operand =
(ptrl+l)->dest-operand;
(ptr5)->top-stack+=l;
1

/*

Loading of the EAC queue

*/

if ( ( (ptr6)->queue-select == 1)
!= 1))
{

&&

(ptr6)->full-queue

/* check to see whether the instruction is a valid


instruction for EAC queue */
if((ptr1 + 2)->valid != 0)
{

/*

instruction is valid

*/

( p t r 3 + t o p- s t a c k 2 ) - > o p c o d e - f i e l d
(ptrl+2)->opcode field;
(ptr3+top- stack2)->source-operand1
(ptrl+2)->source operandl;
(ptr3+top- stack2)->source - operand2
(ptrl+2)->source operand2;
( p t r 3 + t o p- s t a c k 2 ) - > d e s t - o p e r a n d
(ptrl+2)->dest-operand;
(ptr6)->top-stack+=l;

/*

Forwarding the instruction to the decoder

if (current-queue

==

*/

1)

switch((ptr2+bottom-stack1)->opcode-field)
{

case 1:

/*

*/

(ptr4+(ptr4)->decode ptr)->opcode = 1;
(ptr4+(ptr4)->decode-ptr)
->exec-time = 3 ;
load-isunitl(ptr2,ptr3,ptr4,ptr5,ptr6,bottom -stackl);
break;
case 2:

/*

*/

(ptr4+(ptr4)->decode ptr) ->opcode = 2 ;


( p t r 4 + ( p t r 4 ) - > d e c o de p t r ) - > e x e c-time = 3;
load-isunitl(ptr2,ptr3,ptr4,ptr5,ptr6,bottom
break;

-stackl);

case 3:

/*

*/

(ptr4+(ptr4)->decode ptr) ->opcode = 3 ;


( p t r 4 + ( p t r 4 ) - > d e c o d e-p t r ) - > e x e c -time = 8;
load-i~unitl(ptr2,ptr3,ptr4~ptr5~ptr6~bottom
-stackl);
break;

=
=
=
=

case 4:

/*

*/

(ptr4+(ptr4)->decode-ptr)->opcode = 4 ;
(ptr4+(ptr4)->decode-ptr) ->exec-time = 23 ;
load-isunitl(ptr2,ptr3,ptr4,ptr5,ptr6,bottom -stackl);
break;
case 5:

/*

*/

(ptr4+(ptr4)->decode ptr)-Bopcode = 5;
(ptr4+(ptr4)->decode-ptr)->e~ec
-time = 6;
load-isunitl(ptr2,ptr3,ptr4,ptr5,ptr6,bottom -stackl):
break;
case 6:

/*

*/

(ptr4+(ptr4)->decode ptr)->opcode = 6;
(ptr4+(ptr4)->decode-ptr)
->exec-time = 6;
load-isunitl(ptr2,ptr3,ptr4,ptr5,ptr6,bottom -stackl):
break;
case 7:

/*

*/

(ptr4+(ptr4)->decode-ptr)->opcode = 7 ;
(ptr4+(ptr4)->decode-ptr) ->exec-time = 3 ;
load-isunitl(ptr2,ptr3,ptr4,ptr5,ptr6,bottom-stackl):
break;
case 8:

/*

*/

(ptr4+(ptr4)->decode-ptr)->opcode = 8 ;
(ptr4+(ptr4)->decode-ptr) ->exec-time = 3 ;
load-isunitl(ptr2,ptr3,ptr4,ptr5,ptr6,bottom -stackl);
break;
case 9:

/*

*/

(ptr4+(ptr4)->decode ptr)->opcode = 9 ;
(ptr4+(ptr4)->decode-ptr)
->exec-time = 3 ;
-

load-isunitl(ptr2,ptr3,ptr41ptr51ptr61b~tt~m
-stackl);
break;
case 10:

/*

*/

(ptr4+(ptr4)->decode ptr)->opcode = 10;


(ptr4+(ptr4)->decodeWptr)
->exec-time = 3 ;
load-isunitl(ptr2,ptr3,ptr4,ptr5,ptr6,bottom -stackl);
break;
case 11:

/*

*/

(ptr4+(ptr4)->decode ptr) ->opcode = 11;


->exec-time = 3 ;
(ptr4+(ptr4)->decode-ptr)
load i~unitl(ptr2,ptr3,ptr4~ptr5~ptr6~bottom
-stackl);
break;
case 12:

/*

*/

(ptr4+(ptr4)->decode ptr)->opcode = 12 ;
(ptr4+(ptr4)->decode-~tr)
->exec-time = 3 ;
load-isunitl(ptr2,ptr3,ptr4,ptr5,ptr6,bottom -stackl);
break;
case 13:

/*

*/

(ptr4+(ptr4)->decode ptr) ->opcode = 13 ;


(ptr4+(ptr4)->decode-ptr)
->exec-time = 3 ;
load-isunitl(ptr2,ptr3,ptr4,ptr5,ptr6,bottom -stackl);
break;
case 14:

/*

*/

(ptr4+(ptr4)->decode ptr)->opcode = 14 ;
(ptr4+(ptr4)->decode7ptr)
->exec-time = 3 ;
load-isuniti(ptr2,ptr3,ptr4,ptr5,ptr6,bottom -stackl);
break;
case 15:

/*

*/

(ptr4+(ptr4)->decode-ptr)->exec-time

3;

load-is~nitl(ptr2,ptr3,ptr4~ptr5,ptr6~bottom
-stackl);
break;

/*

forwarding the instruction to the issue unit

*/

if(disab1e decode != 1)
{

( p t r 7 + 3 ) - > o p c o d e - f i e l d
(ptr2+bottom-stack1)->opcode-field;
(ptr7+3)->source-operand1
(ptr2+bottom-stack1)->source-operandl;
(ptr7+3)->source-operand2
(ptr2+bottom stack1)->source operand2;
(ptr7+3)->dest
- o p e r a n d
(ptr2+bottom-stack1)->dest-operand;

/*

rearranging the stack


for (i=2;i<=20;i++)

=
=

=
=

*/

(ptr2+(i-1))->opcode-field
(ptr2+i)->opcode-field;
(ptr2+(i-1))->source-operand1
(ptr2+i)->source-operandl;
(ptr2+(i-1))->source - operand2
(ptr2+i)->source-operand2 ;
(ptr2+(i-1))->dest -operand
(ptr2+i)->dest-operand;

(ptr5)->top-stack-=l ;

1
}

if(current-queue
{

==

2)

switch((ptr3+bottom -stack2)->opcode-field)
{

case 1:

/*

*/

(ptr4+(ptr4)->decode ptr) ->opcode = 1;


(ptr4+(ptr4)->decode-~tr)
->exec-time = 3 ;
load-isunit2(ptr2,ptr3,ptr41ptr51ptr61b~tt~m
-stack2);
break;
case 2:

/*

*/

=
=

=
=

(ptr4+(ptr4)->decode-ptr)->opcode = 2;
(ptr4+(ptr4)->decode-ptr) ->exec-time = 3 ;

load-isunit2(ptr2,ptr3,ptr4,ptr5,ptr6,bottom -stack2);
break;
case 3:

/*

*/

load-isunit2(ptr2,ptr3,ptr4,ptr5,ptr6,bottom -stack2);
break;
case 4:

/*

*/

(ptr4+(ptr4)->decode ptr) ->opcode = 4 ;


(ptr4+(ptr4)->decode-ptr) ->exec-time = 23 ;
load-isunit2(ptr2,ptr3,ptr4,ptr5,ptr6,bottom -stack2);
break;
case 5 :

/*

*/

load-isunit2(ptr2,ptr3,ptr4,ptr5,ptr6,bottom
break;

-stack2);

case 6:

/*

*/

(ptr4+(ptr4)->decode ptr) ->opcode = 6 ;


(ptr4+(ptr4)->decode-ptr)
->exec-time = 6 ;
load-isunit2(ptr2,ptr3,ptr4,ptr5,ptr6,bottom-stack2);
break;
case 7:

/*
(ptr4+(ptr4)->decode ptr)->opcode = 7;
(ptr4+(ptr4)->decode-ptr)
->exec-time = 3 ;
load-i~unit2(ptr2,ptr3,ptr4~ptr5~ptr6,bottom
stack2);
break;

case 8:

/*

*/

(ptr4+(ptr4)->decode ptr) ->opcode = 8 ;


(ptr4+(ptr4)->decode-ptr)
->exec-time = 3 ;
load-i~unit2(ptr2,ptr3,ptr4~ptr5~ptr6~bottom
-stack2);
break;
case 9:

/*

*/

(ptr4+(ptr4)->decode ptr)->opcode = 9 ;
(ptr4+(ptr4)->decodeptr)
->exec-time = 3 ;
load-isunit2(ptr2,ptr3,ptr4,ptr5,ptr6,bottom -stack2);
break;
case 10:

/*

*/

(ptr4+(ptr4)->decode ptr)->opcode = 10;


->exec-time = 3 ;
(ptr4+(ptr4)->decode-ptr)
load-i~unit2(ptr2,ptr3,ptr4~ptr5~ptr6~bottom
-stack2);
break;
case 11:

/*

*/

(ptr4+(ptr4)->decode ptr) ->opcode = 11;


(ptr4+(ptr4)->decode-ptr)->exec
-time = 3 ;
load-is~nit2(ptr2,ptr3,ptr4~ptr5~ptr6~bottom
-stack2);
break;
case 12:

/*

*/

(ptr4+(ptr4)->decode ptr)->opcode = 12 ;
(ptr4+(ptr4)->decode-ptr)
->exec-time = 3 ;
load-i~unit2(ptr2,ptr3,ptr4~ptr5~ptr6~bottom
-stack2);
break;
case 13:

/*

*/

(ptr4+(ptr4)->decode ptr) ->opcode = 13 ;


( p t r 4 + ( p t r 4 ) - > d e c o de p t r ) - > e x e c-time = 3;
-stack2);
load-isunit2(ptr2,ptr3,ptr4,ptr5,ptr51ptr61bottom

break;
case 14:

/*

*/

(ptr4+(ptr4)->decode-ptr) -Bopcode = 14 ;
(ptr4+(ptr4)->decode-ptr) ->exec-time = 3 ;
load-isunit2(ptr2,ptr3,ptr4,ptr5,ptr6,bottom -stack2);
break;
case 15:

/*

*/

(ptr4+(ptr4)->decode ptr)->opcode = 15;


(ptr4+(ptr4)->decode-ptr)->exec-time
= 3;
load-isunit2(ptr2,ptr3,ptr4,ptr5,ptr6,bottom -stack2);
break;

/*

forwarding the instruction to the issue unit

*/

if(disab1e-decode != 1)
{

( p t r 7 + 4 ) - > o p c o d e - f i e l d
(ptr3+bottom-stack1)->opcode-field;

(ptr7+4)->source-operand1

(ptr3+bottom-stack1)->source-operandl;
(ptr7+4)->source_operand2
(ptr3+bottom-stack1)->source operand2;
(ptr7+4)->dest
-operand
(ptr3+bottom-stack1)->dest-operand;

/*

rearranging the stack


for (i=2;i<=20;i++)

=
=

*/

(ptr3+(i-1))->opcode- field
(ptr3+i)->opcode-field;
(ptr3+(i-1))->source - operand1
(ptr3+i)->source-operandl;
(ptr3+ (i-1)) ->source - operand2
(ptr3+i)->source-operand2;
(ptr3+(i-1))->dest -o p e r a n d
(ptr3+i)->dest
-operand;

(ptr6)->top-stack-=l;

1
1
(ptr4)->decode ptr+=l;
if ((ptr4)->decode-ptr
{

==

20)

=
=
=
=

(ptr4)->decode-ptr

0;

return (*ptr7);
1

/*
/*
/*
/*
/*

*/

*/
*/
*/

Issue unit

structi5
structiO
structil
structi6
structi5

*/

issue unit(ptrl,ptr2,ptr3,ptr4,ptr5,ptr6,ptr7)
*ptrlT/* pointers to the latches */
*ptr2;
*ptr3,*ptr4;/* decode stack pointers */
*ptr7;

int i,j,k,l;
int issue-pointer, dest-ptr, srcl-ptr, src2-ptr;
i
n
templ,temp2,temp3,temp4,raw delay,waw delaytinst
-delay;
issue-pointer = (ptr2)1>issue-ptr;

/*

issue logic for PIC stream

if((current -queue

==

1)

&&

*/

(disable-issue != 1))

1 =
t
e
m
P
(ptr2+issue-pointer)->count-units[(ptrl+3)->dest -operand];
temp2 = (ptr2+issue-pointer)->count -units
[(ptrl+3)->source operandl];
temp3 =(ptr2+issue -pointer)->count -units
[(ptrl+3)->source operand21;
temp4 =-(ptr2+issue -pointer)->exec-time;

/*

computing RAW hazards

if ( (temp2

==

0)

&&

(temp3

raw-delay

0;

raw-delay

temp2;

raw-delay

temp3;

*/
==

0))

if (temp2 > temp3)


{

raw-delay = temp2;

1
else
{

/*

raw-delay = temp3;

checking for WAR hazards

if (templ

== 0)

waw-delay

1
if((temp1 != 0)
{

1
else
{

waw-delay

raw-delay
&&

1;

(temp1 <= (raw-delay+temp4)))

templ + 2 ;

waw-delay = templ-temp4+3;

/*

computing total delay */


if( raw-delay > waw-delay)
{

1
else

inst-delay

raw delay + 1;

inst-delay

waw-delay + 1;

{
}

/*

register

*/

updating the counter associated with the sink

(ptrZ+issue pointer)->count-units[(ptrl+3)->dest-operand]
inst-delay + temp4 - 1;

issue pointer += 1;
(ptr2y->issue ptr = issue-pointer;
if((ptr2)->is%ue -ptr == 20 )
{

(ptr2)->issue-ptr
1

= 0;

/*

issue logic for EAC stream

if((current -queue

2)

==

*/

(disable-issue != 1))

&&

t e m p l = (ptr2+issue-pointer)->count -units
[(ptrl+4)->dest operand];
t e m p F = (ptr2+issue-pointer)->count -units
[(ptrl+4)->source operandl];
temp3 r(ptr2+issue -pointer)->count -units
[(ptrl+4)->source operand2J;
temp4 =-(ptr2+issue -pointer)->exec-time;

/*

computing RAW hazards

if ( (temp2

==

0)

*/

(temp3

&&

==

0))

raw-delay

1
if ( (temp2 > 0)

0;

(temp3

&&

==

0))

raw-delay

temp2;

raw-delay

temp3;

1
if ((temp2 > 0)

(temp3 > 0))

&&

if(temp2 > temp3)


{

raw-delay

temp2;

else
{

raw-delay = temp3;

/*

checking for WAR hazards

if (templ
{

==

0)

waw-delay

raw-delay + 1;

if((temp1 != 0)

&&

(temp1 <= (raw-delay+temp4)))

waw-delay
}

else
{

temp1

+ 2;

waw-delay

templ-temp4+3;

/*

computing total delay */


if( raw-delay > waw-delay)
{

inst-delay

raw-delay + 1;

1
else
{

inst-delay = waw-delay + 1;

register

/*
*/

updating the counter associated with the sink

(ptr2+issue pointer)->count-units[(ptrl+3)->dest-operand]
inst-delay + temp4 - 1;

issue pointer += 1;
(ptr27->issue-ptr = issue pointer;
if ( (ptr2)->issue-ptr == 2 0 )

1
return (*ptr7);
}

/*
/*
/*
/*
/*
/*

Function

*/
*/
Initializations
*/
*/
*/

void initialize(numl,num2)
struct7 *numl;
struct8 *num2;
{

int i,j,k,l;
for(i=l;i<=40;i++)
{

(numl+i)->location = i;
(numl+i)->func = 0;
(num2+i)->destination = i;

/*
/*
/*
/*
/*
/*

*/
*/
Function Re-Initializations */
*/
*/
*/

struct7 reinit (numl)


struct7 *numl;
{

struct7 *temp;
int ifjfkfl;
temp = numl;
1 = 1;
for(i=l;i<=40;i++)
{

if ( (numl+i)->func
{

== 5)

(numl+i)->location
1 = 1+ 1;

1
else if((numl+i)->func
{

1;

==

4)

(numl+i)->location

1;

1
else

1
numl = temp;
return (*numl);
1

/*
/*
/*
/*
/*
/*

Function stage 1

*/
*/
*/
*/
*/

*/

struct2 stage-one(number-one,number-two,num-three,num-passl,
num pass2)
structl *number one,*number-two;
struct2 *num three;
int num-passi,num-pass2;
{

int i,jfkfl;

structl *pl,*p2;
struct2 *p3;
pl = number one;
p2 = number-two;
p3 = nurn three;
for (i=0;i<8;++i)
{

( n u m t h r e e + = i) - > w o r d [ i + j ] =
(number one += num-passl) ->bits [j]) * ( (number-two +=
num-pass27->bits[i]) ;
number one = pl;
number-two = p2;
num-thTee = p3;
1
1
printf ( " \nw);
printf(I1 the partial products calculated in function stage1
are as follows \nI1);
printf ( " \nI1);
printf(" \nw);
printf (I1 \nu);
for(i=O;i<8;++i)
(

printf(I1The partial product W %d is\nw,i);


printf ("\nI1);
for(j=O;j<l6;++j)
{

i)->word[j])

p r i n t f ( I 1 % d 11,( n u m - t h r e e + =

num-three = p3;
1
printf (I1\n1l)
;
printf (lt\nll)
;
1
return (*num-three) ;
1

/*
/*
/*
/*

Function Stage 2

/*

*/
*/
*/
*/

*/

struct2 stage t w o ( n u m l , n u m 2 , n u m 3 , n u m 4 , n u m 5 , n u m 6 ,
num7,num8,num9,num10)
struct2 * n u m l , * n u m 2 , * n u m 3 , * n u m 4 1 * n u m 5 1 * n u m 6 1
*num7,*num8,*num9,*numlO;
{

int i,j,k,l,nega,negb,negc;
struct2 *pl,*p2,*p3,*p4,*p5,*p6,*p71*p8~*p9,*p10;
pl = numl;
p2 = num2;

/*

p3 = num3;
p4 = num4;
p5 = num5;
p6 = num6;
p7 = num7;
p8 = num8;
p9 = num9;
p10 = numl0;
realization of csa unit number one
for(i=O;i<l6;++i)
{

nega = 0;
negb = 0;
negc = 0;
if ( (numl)->word[i]

==

0)

nega = 1;
1
if ( (num2)->word[i] == 0)
{

negb

1;

1
if ( (num3)->word[i]
{

/*

negc

==

0)

1;

1
realization of csa unit number two
for(i=o;i<l6;++i)
{

nega = 0;
negb = 0;
negc = 0;
if ( (num4)->word[i]
{

nega

negb

negc

==

0)

1;

1
if ((num6)->word[i]
{

0)

1;

1
if ( (num5)->word[i]
{

==

==

1;

0)

*/

/*
/*
/*
/*
/*

*/

Function Stage 3

*/
*/
*/

*/

struct2 stage t h r e e ( n u m 1 , n u m 2 , n u m 3 ~ n u m 4 , n u m 5 , n u m 6 ,
num7,num8,num9,~um10)
struct2 * n u m l , * n u m 2 , * n u m 3 , * n u m 4 , * n u m 5 , * n u m 6 ,
*num7,*num8,*num9,*numlO;
{

/*

int i,j,k,l,nega,negb,negc;
struct2 *pl,*p2,*p3,*p4,*p5,*p6,*p61*p71*p81*p91*p10;
pl = numl;
p2 = num2;
p3 = num3;
p4 = num4;
p5 = num5;
p6 = num6;
p7 = num7;
p8 = num8;
p9 = num9;
p10 = numl0;
realization of csa unit number three
for(i=O;i<l6;++i)
{

nega = 0;
negb = 0;
negc = 0;
if ( (numl)->word[i] == 0)
{

nega = 1;

1
if ((num2)->word[i]
{

== 0)

negb = 1;

1
if((num3)->word[i]

==

negc

1;

0)

num7->word[i] = ((((numl->word[i]*negb*negc)
I (num2->word[i]*nega*negc)) ( (num3->word[i]*nega*negb)) I
(numl->word[i]*num2->word[i]*num3->word[i]));
n u m 8 - > w o r d [ i + l ]
= ( ( n u m l - > w o r d [ i ] * n u m 2 - > w o r d [ i ] )
I
( n u m 3 - > w o r d [ i ] * n u m l - > w o r d [ i ] )
I
(num2->word[i]*num3->word[i]));

/*

realization of csa unit number four


for(i=O;i<l6;++i)

*/

nega = 0;
negb = 0;
negc = 0;
if ( (num4)->word[i] == 0)
{

nega = 1;

if ( (num5)->word[i] == 0)
{

negb

= 1;

negc

1;

num9->word[i] = ((((num4->word[i]*negb*negc)
I(num5->word[i]*nega*negc)) 1 (num6->word[i]*nega*negb)) 1
(num4->word[i]*num5->word[i]*num6->word[i]));
n u m l 0 - > w o r d [ i + l ]
= ( ( n u m 4 - > w o r d [ i ] * n u m 5 - > w o r d [ i ] )
I
( n u m 6 - > w o r d [ i ] * n u m 4 - > w o r d [ i ] )
I
(num5->word[i]*num6->word[i]));

return(*num7,*num8,*num9,*numl0);

/*
/*
/*
/*

/*

Function Stage 4

*/
*/
*/
*/

*/

struct2 stage four(numl,num2,num3,num4,num41num5)


struct2 *numl~*num2,*num3,*num4,*num5;
{

int i,j,k,l,nega,negb,negc;
struct2 *pl1*p2,*p3,*p4,*p5;
pl = numl;
p2 = num2;

/*

p3 = num3;
p4 = num4;
p5 = num5;
realization of csa unit number five
for(i=O;i<16;++i)
{

nega = 0;
negb = 0;
negc = 0;
if ( (numl)->word[i]

==

0)

nega = 1;

if((num2)->word[i]

==

0)

negb

1;

if((num3)->word[i]

==

0)

negc

num4->word[i]

1;

((((numl->word[i]*negb*negc)
I (num2->word[i]*nega*negc)) I (num3->word[i]*nega*negb)) I
(numl->word[i]*num2->word[i]*num3->word[i]));
=

n u m 5 - > w o r d [ i + l
= ( ( n u m l - > w o r d [ i ] * n u m 2 - > w o r d [ i ] )
( n u m 3 - > w o r d [ i ] * n u m l - > w o r d [ i ] )
(num2->word[i]*num3->word[i]));

1
return(*num4,*num5);

/*
/*
/*
/*
/*

*/

Function Stage 5

*/

*/
*/

*/

struct2 stage five(numl,num2,num31num41num5)


struct2 *numl~*num2,*num3,*num4,*num5;
{

/*

int i,j,k,l,nega,negb,negc;
struct2 *pl,*p2,*p3,*p4,*p5;
pl = numl;
p2 = num2;
p3 = num3;
p4 = num4;
p5 = num5;
realization of csa unit number six
for(i=O;i<l6;++i)
{

nega

0;

]
I
1

negb = 0;
negc = 0;
if ( (numl)->word [i]
{

nega

==

0)

negb = 1;

1
if ((num3)->word[i]
{

0)

1;

1
if ( (num2)->word[i]
{

==

==

0)

negc = 1;

1
num4->word[i]

((((numl->word[i]*negb*negc)

I (num2->word[i]*nega*negc)) I (num3->word[i]*nega*negb)) I
(numl->word[i]*num2->word[i]*num3->word[i]));
n u m 5 - > w o r d [ i + l
= ( ( n u m l - > w o r d [ i ] * n u m 2 - > w o r d [ i ] )
( n u m 3 - > w o r d [ i ] * n u m l - > w o r d [ i ] )
(num2->word[i]*num3->word[i]));

I
I

1
return(*num4,*num5);
}

/*
/*
/*
/*
/*

Function Stage 6

*/
*/
*/
*/

*/

struct2 stage six(numl,num2,num3,num4)


struct2 *numl~*num2,*num3;
structl *num4;
{

int i,j,k,lfnegafnegb,negc,carry[l7];
struct2 *pl,*p2,*p3;
structl *p4;
pl = numl;
p2 = num2;
p3 = num3;
p4 = num4;
carry[O] = 0;
/* here the distinction is being made between add
the rest */
if((addition == 1) I I (subtraction == 1))
{

if(addition

==

1)

printf (I1 addition is one \nu);


for(i=O;i<=7;i++)
{

&

sub and

(numl)->word[7-i] = (p4+1)->bitsCil;
(num2)->word[7-i] = (p4+2)->bits[il;
addition = 0;
for(j=8;j<=16;j++)
{

1
if(subtraction
{

==

1)

printf ( " subtraction is one \n1I);


for(i=O;i<=7;i++)
{

/*

(numl)->word[7-i] = (p4+1)->bits[i];
inverting the operand */
if ( (p4+2)->bits[i] == 1)
{
(p4+2)->bits[i] = 0;

1
else
{

(p4+2)->bits [i]

1;

1
carry[O] = 1;
subtraction = 0;
for(j=8;j<=16;j++)
{

1
1
printf ("\nn);
printf (I1\nl1)
;
printf(" the following is the entered numbers\nl1);
printf ("\nu);
printf (I1\nl1)
;
printf(I1 the value of numl loaded is (16 - O)\nl1);
for(j=O;j<=l6;j++)
{

printf (I1 %d 11,numl->word[16-j


])

printf ("\n") ;
printf ("\nI1);
printf(I1 the value of num2 loaded is (16
for(j=O;j<=16;j++)
{

printf (I1 %d ",num2->word[l6-j] )

O)\nll);

printf ("\nl1);
printf ("\nW);

/*

realization of carry propagation adder


for(i=O;i<16;++i)
{

nega = 0;
negb = 0;
negc = 0;
if ( (numl)->word[i]

== 0)

nega = 1;
1
if((num2)->word[i] == 0)
{

negb = 1;
1
if(carry[i] == 0)
{

negc = 1;

num3->word[i] = ((((numl->word[i]*negb*negc)

I (num2->word[i]*nega*negc)) I (carry[i]*nega*negb)) I

(numl->word[i]*num2->word[i]*carry[i]));
carry[i+l] =((numl->word[i]*num2->word[i])
I (carry[i]*numl->word[i]) 1 (num2->word[i]*carry[i]));

1
return (*num3);
1

/*
/*
/*
/*
/*

Function Delay One

*/
*/
*/
*/

*/

struct2 delay one(numl,num2,num3,num4)


struct2 *numl~*num2,*num3,*num4;
{

int i,j,k;
struct2 *PI,*p2,*p3,*p4 ;
pl = numl;
p2 = num2;
p3 = num3;
p4 = num4;
for (i=O;i<l7;i++)
{

num3->word[i] = numl->word[i];
num4->word[i] = num2->word[i];

return (*num3,*num4) ;

/*

/*
/*
/*
/*

unction Delay Two

*/
*/
*/

*/

struct2 delay two(numl,num2)


struct2 *numl,*num2;
{

int i,j,k;
struct2 *pl,*p2;
pl = numl;
p2 = num2 ;
for (i=O;icl7 ;i++)
{

num2->word[i] = numl->word[i];

return (*num2);

/*
/*

starting of the pipeline */


entering the values of the arguments

*/

void pipeline()
{

s
t
r
u
c
t
2
*pone,*ptwo,*pthree,*pfourI*pfive,*psix,*pseven,*peight;

struct2 *pnine,*pten;
int one,two,i,j,k,l,m,v;
printf (I1 \nn);
printf ( " \nu);
/*printf(" enter the argument one from bit 8 to 0 \nlt);
s c a n f ( " % d
% d
% d
% d
% d
% d
% d
% d
%d~,&argumentl[O].bits[8]I&argument1[O].bits[7],
&argument1[0].bits[6],&argument1[0].bits[5],
&argument1[0].bits[4],&argument1[0].bits[3],
&argumentl[O].bits[2]I&argument1[O].bits[l]f

&argumentl[O].bits[O]);
printf (I1 \ntl
) ;
printf (It \nw);
printf(I1 enter the argument two from bit 7 to 0 \nV1);
s c a n f ( II % d % d % d % d % d % d %

%d11,&argument2[0].bits[7]I&argument2[O].bits[6],
&argument2[0].bits[5]I&argument2[O].bits[4]l
&argument2[0].bit~[3]~&argument2[O].bits[2],

&argument2[0].bits[l], &argument2[0].bits[O]);
printf ( " \ntl)
;*/
printf ( " the data is fed from mpreg + 3,4 \nI1);
if(multip1ication == 1)
{

multiplication = 0;

for(j=O;j<=7;j++)
{

(argl-pointer+O)->bits[7-j] = (mpreg-ptr + 3)->bits[j];


(arg2-pointer+O)->bits[7-j 1 = (mpreg-ptr + 4)->bits[j 1 ;

1
1
if ( (division == 1)

I (delta-flag

==

1))

division = 0;
delta flag = 0;
for(j=0;j<E7;j++)
{

(arg2-pointer+0)->bits[7-j 1

(mpreg-ptr

(argl-pointer+0)->bits[8-j1

(mpreg-ptr + 4)->bits[ j1 ;

1
1
if (init-key

==

3)->bits[ j] ;

1)

init key = 0;
for(T=o;j<=8;j++)
{

(argl pointer+O)->bits[8-j]
(arg2pointer+0)->bits[8-j]
-

=
=

0;
0;

printf (I1 the arguments are initialised to zero in


init-key\nI1);
1
printf (I1 \nw);
printf (I1 the argument one is printed below 8 - 0 bits in
correct place\nV1);
printf (I1 \nw);
printf (I1 \nI1);
for(j=O;j<=8;j++)
{

printf (I1 %d 11,(argl-pointer+O)->bits[8-j1 ) ;


1
printf (I1 \nI1);
printf ( " \nw);
printf (I1 the argument two is printed below 8
correct place\nI1);
printf ( " \nw);
printf (I1 \nI1);
for(j=O;j<=8;j++)
{

printf (I1 %d 11,(arg2-pointer+0)->bits[8-j1 )

0 bits in

printf ( " \nn);


printf (I1 \nl#)
;
one = 0;
two = 0;
stage-one(arg1 -pointer,arg2-pointer,par-pointer,one,two);
printf (I1 \nn);
printf (I1 \nw);
printf (I1 \nl1);
printf ( " the arguments are \nw);
printf (I1 \n1#)
;
printf (I1 \nw);
printf(I1 argument1 is 8 - O\nl1);
printf (I* \nl@)
;
for (i=0;i<=8;++i)
{

printf(I1 %d w,argumentl[O].bits[7-i]);
1
printf ( " \nw);
printf (I1 \nI1);
printf (I1 argument2 is 8 - O\nn);
printf ( " \nu);
for (i=0;i<=8;++i)
{

printf (I1 %d 11,argument2


[o] .bits[8-i]) ;
)

printf ( " \nu);


printf ( " \nw) ;
printf(I1 the partial products calculated after function
stage1 are as follows \n1I);
printf ( " \ngl)
;
printf ( " \nM);
printf ( " \n1#)
;
for(i=O;i<8;++i)
{

printf (nThe partial product W %d is\nV1,i)


;
printf (I1\nl1)
;
for(j=O;j<l6;++j)
{

printf

par-pointer
1
printf (I1\nt1)
;
printf ("\nll);

/*

stage two starts now


par pointer = copy three;
lat-pointer = copy-four;
trans pointer = copy five;
delayptr
= copy-six:

(I1

%d

" , ( p a r- p o i n t e r + =

copy-three;

*/

pone = trans pointer + 0;


ptwo = transpointer + 1;
pthree = trans pointer + 2;
pfour = trans pointer + 3;
pfive = transpointer + 4;
psix = trans pointer + 5;
pseven = lat-pointer + 0;
peight = lat-pointer + 1;
pnine = lat pointer + 2;
pten = lat pointer + 3;
s t a g e- t w ~ ( p o n e , p t w o , p t h r e e , p f o u r , p f i v e , p s i x ,
pseven,peight,pnine,pten);
pone = delay ptr + 3;

ptwo = delayptr + 4;
pthree = delay ptr + 0;
pfour = delay ptr + 1;
delay one (pone,ptwo,pthree,pfour) ;
~ r i n t f (\nn)
~ ;
printf ( " \nw);
printf ( " \nu);
print(" THE SUM AND CARRY VECTORS OF STAGE TWO \n1I);
printf ( " \nl1);
printf(I1 FIRST AND THIRD ARE SUM VECTORS S1 AND S2 \n1I);
printf ( " \nw);
printf (I1 SECOND AND FOURTH ARE CARY VECTORS C1 AND C2 \nl');
printf (I1 \nu);
printf ( " \ntl)
;
printf (I1 \nM);
for(i=O;i<4;++i)
{

printf ( " \nw);


for(j=O;j<l6;++j)
{

p r i n t f ( " % d 11,( l a t- p o i n t e r + =

i)->word[j]) ;
lat-pointer
1
printf (I1\nw)
;
printf ("\nI1);
}

/*

copy-four;

stage three starts now


par pointer = copy three;
lat-pointer = copy-four;
trans pointer = copy-five;
delay-ptr
= copy six;
pone = trans pointer + 6;
ptwo = transpointer + 7;
pthree = trans pointer + 8;
pfour = trans pointer + 9;
pfive = transpointer + 10;
psix = trans-pointer + 11;

*/

pseven = lat pointer + 4;


peight = latpointer + 5;
pnine = lat pointer + 6;
pten = lat pointer + 7 ;
stage t h F e e ( p o n e , p t w o , p t h r e e , p f o u r , p f i v e , p s i x ,
pseven ypeight,pnine,pten) ;
printf ( "
printf ( "
printf (I1
printf("
printf (I1
printf("
printf ( "
printf (I1

\nu);
\nI1);
\nit) ;
THE SUM AND CARRY VECTORS O F STAGE THREE \ n n ) ;
\n1I);
F I R S T AND T H I R D ARE SUM VECTORS S3 AND 54 \nI1);
\nw);
SECOND AND FOURTH ARE CARY VECTORS C 3 AND C 4 \nrl)
;

print f (I1 \n1I);


printf (I1 \nw);
printf ( " \ntl)
;
for (i=4;ic8 ;++i)
{

printf ( " \nN);


for(j=O;jc16;++j)
{

i)->word[j])

printf

lat-pointer
1
printf (If\nn);
printf ("\nW);
}

(I1

% d It,( l a t- p o i n t e r + =

copy-four;

stage four starts now */


par pointer = copy three;
latpointer = copy-four ;
trans pointer = copy-five;
= copy six;
delay-ptr
pone = trans pointer + 12;
ptwo = trans-pointer + 13;
pthree = trans pointer + 14;
pfour = lat pointer + 8;
pfive = lat-pointer + 9;
stage-four(~one,ptwo,pthree,pfour,pfive);
pone = delay ptr + 5;
ptwo = delay-ptr + 2;
delay two (pone,ptwo) ;
printf ( " \nn);
printf (I1 \nu);
printf(" THE SUM AND CARRY VECTORS O F STAGE FOUR \n1I);
printf (I1 \nw);
printf ( " F I R S T I S THE SUM VECTOR S5 \nI1);
printf (I1 \nu);
printf(" SECOND I S THE CARRY VECTOR C 5 \nVt)
;

/*

printf (I1 \nI1);


printf ( " \nw);
printf (I1 \n1I);
for(i=8;i<lO;++i)
{

+=

printf ( " \nit) ;


printf (It the value of pointer is %~\n~~,lat_pointer

i);
lat pointer = copy-four;
printf ("\nll);
for(j=d;j-<l6;++j)
{

i)->word[j ] )

printf

(I1

% d It,( l a t- p o i n t e r

+=

lat-pointer

copy-four;

printf (n\nll)
;
printf ("\nN);
1

stage five starts now */


par pointer = copy-three;
lat-pointer = copy-four;
trans pointer = copy-five;
= copy six;
delay-ptr
pone = trans pointer + 15;
ptwo = transpointer + 16;
pthree = trans pointer + 17;
pfour = lat pointer + 10;
pfive = lat-pointer + 11;
stage five(~one,ptwo,pthree,pfour,pfive);
printf (I1 \nu);
printf (I1 \nn);
printf ( " \nw);
printf (I1 THE SUM AND CARRY VECTORS OF STAGE FIVE \nIf)
printf (I1 \ngl)
;
printf(I1 FIRST IS THE SUM VECTOR S6 \nn);
printf (I1 \nl1);
printf(I1 SECOND IS THE CARRY VECTOR C6 \nl1);
printf ( " \nw);
printf ( n \nn);
printf(" \nn);
for(i=lO;i<l2;++i)

/*

printf (If \nv1)


;
for(j=O;j<l6;++j)
{

p r i n t f ( " % d !I, ( l a t- p o i n t e r + =

i)->word[j]) ;
lat-pointer
1
printf (l1\nN)
;
printf ("\n1I);

copy-four;

/*

stage six starts now */


par pointer = copy three;
lat-~ointer= copyIfour;
trans pointer = copy-five;
delayxptr = copy six;
pone = trans ~ointer+ 18;
ptwo = transpointer + 19;
pthree = lat-pointer + 12;
-ptr);
stage ~ix(~o~e,ptwo,pthree,mpreg
(I1
\n1I);
printf (I1 \nI1);
printf (I1 \nl1);
printf(" THE PRODUCT O F STAGE S I X \n1I);
printf ( I \n1I);
printf (I1 \nw);
for(j=O;j<l6;++j)
{

printf

(I1

% d It,( l a t- p o i n t e r + =

12)->word[(15-j) 1);
lat-pointer = copy-four;
( " \ n n );
printf ("\n1I);
/* stage seven */
par pointer = copy three;
lat~ointer= copy-four;
trans pointer = copy-five;
delay-ptr
= copy six;
pone = trans pointer + 20;
ptwo = delayPptr + 7;
delay t~o(~one,ptwo);
~ r i n t (f" \nI1);
printf ( " \nI1);
printf ( " \nI1);
printf(" THE RESULT O F STAGE SEVEN \nl');
printf (I1 \nw);
printf (I1 \nI1);
for(j=O;j<16;++j)
{

7)->word[(15-j)])

printf

(I1

%d

It,( d e l a y - p t r

+=

delay-ptr = copy-six;
}

return ;

/*

this stage represents the off period of the clock cycle

*/

struct2 time-off()
{

int one,two,i,j,k,l,m,v;
for(v=O;v<l7;v++)
{

(trans-pointer

+=

0)->word[v]

(par-pointer += 0)->word[v];

(par-pointer

(par-pointer += 2)->word[v];

(par-pointer += 3)->word[v];

(par-pointer

(par-pointer += 5)->word[v];

par pointer = copy three;


trans-pointer = copy-five;
1
for(v=O;v<l7;v++)
{

(trans-pointer

+=

1)->word[v]

+= 1)->word[v];

trans pointer = copy five;


par-pointer = copy-three;
1
for(v=O;v<l7;v++)
{

(trans-pointer

+=

2)->word[v]

trans pointer = copy five;


= copy-three;
par1
for(v=O;v<l7;v++)
{

(trans-pointer += 3)->word[v]
trans pointer = copy five;
par-pointer = copy-three;
1
for(v=O;v<l7;v++)
{

(trans-pointer

+=

4)->word[v]

+= 4)->word[v];

trans pointer = copy five;


par-pointer = copy-three;
1
for (v=0;v<l7;v++)
{

(trans-pointer

+=

5)->word[v]

trans pointer = copy five;


par-pointer = copy-three;
1
for(v=O;v<l7;v++)
{

(delay ptr += 3)->word[v]


delay ptr = copy six;
par-pointer = copy-three;
1

(par-pointer += 6)->word[v];

(delay ptr += 4)->word[v]


delay ptr = copy-six;
par-pointer = copy-three;
1
for(v=O;v<l7;v++)
{

(trans-pointer

+=

(par-pointer

+= 7)->word[v];

(delay-ptr

+= 0)->word[v] ;

(delay-ptr

+= 1)->word[v];

(lat-pointer

+= 0)->word[v];

(lat-pointer

+= 1)->word[v];

(latjointer += 2)->word[v];

(lat-pointer

10)->word[v]

trans pointer = copy five;


d e l a~ - ~ t=
r copy six?
1
for(v=O;v<l7;v++)
{

(trans-pointer

+=

11)->word[v]

trans pointer = copy-five;


delayptr
= copy-six;
1
for (v=O;v<l7;v++)
{

(transjointer += 6)->word[v]
trans pointer = copy five;
lat-pointer = copy-four;
1
for(v=O;v<l7;v++)
{

(transjointer += 7)->word[v]
trans pointer = copy-five;
lat-pointer = copy-four;
{

(trans-pointer

+=

8)->word[v]

trans pointer = copy five;


lat-~ointer= copy-four;
1
for(v=O;v<l7;v++)
{

(transjointer += 9)->word[v]

+= 3)->word[v];

trans pointer = copy five;


lat-pointer = copy-four;

for (v=O;v<l7;v++)
{

(trans-pointer

+=

12)->word[v]

(lat-pointer +=

trans pointer = copy five;


latgointer = copy-four;
1
for(v=O;v<l7;v++)
{

(trans pointer += 13)->word[v]


5)->wordTv];
trans pointer = copy five;
lat-pointer = copy-four;
1
for(v=O;v<l7;v++)

( l a t- pointer +=

(lat-pointer

( t r a n s pointer += 14)->word[v]
6)->word[v]
;
trans pointer = copy five;
lat-pointer = copy-four;
1
for(v=O;v<l7;v++)
{

(delay ptr += 5)->word[v]


delay ptr = copy-six;
lat-pointer = copy-four;
1
for(v=O;v<l7;v++)

(lat-pointer

(trans-pointer += 15)->word[v]

+=

+= 7)->word[v];

(delay-ptr += 2)->word[v];

trans pointer = copy five;


delayzptr = copy-six:
\

for (v=O;v<l7;v++)
{

(trans pointer += 16) ->word[v]


9)->word[v]
;
trans pointer = copy five;
lat-pointer = copy-four;
1
for(v=O;v<l7;v++)

(lat- pointer + =

(lat-pointer

(lat-pointer +=

(trans pointer += 17)->word[v]


8)->word[v]
;
trans pointer = copy five;
lat-pointer = copy-four;
1
for(v=O;v<l7;v++)
{

(trans pointer += 18)->word[v]


10)->word[v] ;
trans pointer = copy five;
lat-pointer = copy-four;
1

+=

( t r a n s pointer += 19)->word[v] = (lat-p o i n t e r +=


11)->word[v];
trans-pointer = copy-five;
lat-pointer = copy-four;

1
{

(trans pointer += 20)->word[v]


12)->word[v] ;
trans pointer = copy five;
lat-pointer = copy-four;
1
{

(trans-pointer

+=

(lat-p o i n t e r +=

2 1)->word[v] = (delay-ptr

+=

7)->word[v];

trans-pointer = copy-five;
delay-ptr = copy-six;

return;
1
Function cal delta */
struct7 cal delta (bl,b2,b3,b4)
struct7 *blj
int *b2[20l1b3,b4; /* b3 = ref-num2

/*
{

, b4

struct7 *tempi;
int *temp2 [81, temp3 [ 8],temp5 ;
int i,j,k,l,carry;
/*printf(" entered cal delta\nw);*/
/ * inverting of the passed argument
for(i=l;ic=8;i++)
{

if ( * (b2[b3]+i)
{

else

==

ref-numl

*/

*/

1)

temp3[i] = 0;

1
{

temp3[i]

1;

1
/*printf (Ittheinverted value in cal-delta \nl');
for (i=l;i<=8;i++)
{

printf (

If

printf ("\nl1);*/

%d ",temp3[i])

/*

adding the one to form delta


carry =1;
i =1;
while(carry == 1)

temp3[9-i]
carry = 1;
else

0;

carry
i++ ;

*/

0;

/*printf("the converted value in cal-delta after adding


one\ngg
) ;
for (i=l;i<=8;i++)
{

printf( " %d gg,temp3[i]);


1
printf (gf\ngg)
;
*/
/ * loading of data into delta */
temp5 = b4 + 1;
for(i=l;i<=8;i++)
{

(bl
(bl
(bl

+
+
+

temp5)->num one[i] = temp3[i];


temp5)->num-two[i] = temp3 [i];
b4)->num-two[i] = temp3[i] ;

(bl + temp5)->func = 5;
(bl + b4)->num-two[O] = 1;
return (*bl);
1

/*

Function subtract load */


struct7 subtract-load(cl,c2,~3,~4,c5,~6)
struct7 *cl;
int *c2,*c3 [20], *c4 [20],c5,c6; / * c5= ref-num2 , c6
ref-numl */

*/

struct7 *tempi;
int *temp2,temp3[8],temp4[8];
int i,j,k,l,nega,negb,negc,carray,carry[93;
/*printf("entered subtract load \nn);*/
/ * the process below finds out the twos complement of

/*

inverting of the passed argument


for (i=l;i<=8;i++)

*/

temp3[i]
else

0;

/*

adding the one to form delta


carray = 1;
i = 1;
while(carray == 1)
{

else

*/

temp3[9-i] = 0;
carray = 1;

1
{

temp3[9-i] = 1;
carray = 0;

1
i++ ;
}

/*

the twos complemnt is calculated */


/*printf (I1 the twos complement of D \nIf);
for (i=l;i<=8;i++)
{

printf ( " the value of i = %d\nfl,i)


;
printf (
%d\n I f , temp3 [i]) ;
}

printf (ll\nlt)
;*/
/ * the below segment adds N and D 1 s 2's complement

*/

carry[O] = 0;
carry[l] = 0;
for (i=l;i<=8;++i)

/*

printf (I1 the value of iteration \nl'); */


nega = 0;
negb = 0;
negc = 0;
/*
printf ( I 1 the value of * (c3 [c5]+ (9-%d) )
%d\nn,i,*(c3[c5]+(9-i))); */
if ( * (c3[c5]+ (9-i)) == 0)
{

/*

nega

1;

1
printf

(If

t h e v a l u e of temp3 [9-%dl

negb

1
if(carry[i]
{

==

negc

1
temp4 [i] =

1;

0)
=

1;

*negb*negc)
I ( t e m p 3 [ 9 - i ] * n e g a * n e g c ) ) I (carry[i]*nega*negb)) I
( * (c3[c5]+(9-i) ) *temp3 [9-i]*carry[i]) ) ;
/*
printf(" the partial product of temp 4 with i =
%d\nfl,
i) ;
printf(
%d \nw,temp4[i]);
*/
carry[i+l] =(*(c3[c5]+(9-i))*temp3[9-i]) I
(carry[i]**(c3[c5]+(9-i)) I (temp3[9-i]*carry[i]));

/*

( ( ( ( * (c3[c5]+(9-i) )

printf (I1 the value of N


for (i=l;i<=8;i++)
{

printf(

D \nl*)
;

%d ",temp4[i]);

1
printf (l1\nW)
;
*/
/ * loading of N - D into num-one
for(i=l;i<=8;i++)
{

(cl

*/

c6)->num-one[9-i]

temp4 [i];

1
return (*cl);
1

Function compare & load */


struct7 compare-load(alIa2,a3,a4,a5,a6)
struct7 *al;
int *a2 [20],*a3[20] ,a4,a5,*a6; / * a4 -> ref-num2 , a5 ->
ref-numl */
{
/ * a2 -> argl; a3 -> arg2; a6 -> func */
struct7 *tempi;
i
n
t
*temp2,*temp3,temp4,*temp51iIjIk111flagone,flag-two;
/ * the comparison of the two arguments */
flag one = 0;
flag-two = 0 ;
for (T=l;i<=8;i++)

/*

if ( * (a2[a4]+i) >
{

* (a3[a4]+i) )

flag-one
i = 8;

1;

if ( * (a2[a4]+i) <
{

1
1
if(f1ag-two

* (a3[a4]+i) )

flag-two
i = 8;
==

1;

1)

for (i=l;i<=8;i++)
{

into numl

(al+a5)->num-one[i]

*/

*(a2[a4]+i);/*

argl loaded

/*printf(" the value of d is greater than n\nv);*/


(al+a5)->func = * (a6+a4);/* function value is
loaded */
/*printf(" the value of opcode loaded in compare&load
is %d\nl1,(al+a5)->func);*/
cal delta(al,a3,a4,a5) ;/* loading of delta into num2
and creating delta line */
(a1 + a5)->over-flow = 0;
1
else
{

/*

printf (I1 the value of n is greater than d\nn);*/

(al+a5)->func = * (a6+a4);/* function value is


loaded */
/*printf(" the value of opcode loaded in compare&load
is %d\nw , (al+a5)->func) ;*/
cal delta(a1, a3,a4,a5) ;/* loading of delta into num2
and creating delta line */
(a1 + a5)->over-flow = 1;

return (*al);

/*

Function pre processor */


struct7 pre proc(numl,num2,num3,num4)
struct7 *numi ;
int *num2,*num3[20],*num4[20];/* num2 -> function; num3 ->
argumentl; */
{
/* num4 -> argument4 */
/ * intialisations */
struct7 *templ,*temp2;
int *temp3,*temp4,*temp5;
int temp-flagl,temp-flag2,temp-flag3;
int i,j,k,l,ref num1,ref-num2,ref_num3;
temp1 = numlT
/ * the testing of the type of function */

ref numl = I;/* indexing pre processor structure


ref-num2 = I;/* indexing the data array */
ref-num3 = 0;
forTref-num2=l;ref-num2<=stk-ptr;ref-num2++)

*/

/*printf ("the value of the condition is %d\nI1,* (num2


ref-num2)) ;
p r i n t f ( I 1 t h e p r e s e n t v a l u e o f r e f - n u m l %d
\nn,ref numl) ;*/
swTtch ( * (num2 + ref-num2) )

case 1:
/ * printf(" case number one \nw); */
(numl+ ref numl)->func = * (num2 + ref-num2) ;
for(i=l;i<=8;i++)
{

(numl+
(numl+
1
(numl+
(numl+

ref num1)->num one[i]


ref-num1)->num-two[i]
-

=
=

*(num3[ref num2]+i);
*(num4[ref-num2]+i);
-

ref num1)->over flow = 0;


ref-num1)->weight = 0;
/*
of the input stack */
/*for(i=l;i<=ref-numl;i++)
{

printf (I1 the input stack is printed below with


ref numl) ;
ref-numl as %d \ntl,
printf (Ittheopcode is %d\nt1
, (numl + i)->func);
printf (Itthevalue of argument one is as follows
\nu);
for(j=l;j<=8;j++)
{

printf("

%d

" , (numl +

i)-mum -one[j]);
printf ("\nW);
printf(I1the value of argument two

is as follows

\nn);
for(j=l;j<=8;j++)

1
printf ("\nN);

1*/
ref numl = ref numl + 1;
/*printf ( " reached break at case one \nll);*/
break;

/*

case 2:
printf ( " case number two \nw) ;*/
(numl+ ref num1)->func = *(num2 + ref-num2);
for (i=l;icz8;i++)

(numl+ ref num1)->num one[i]


(numl+ r e fnuml)->num-two[i]
-

=
=

*(num3[ref num2]+i);
* (num4[ref-num2]+i)
;

(numl+ ref num1)->over flow = 0;


(numl+ ref-num1)->weigEt = 0;
/*~rintf(~reachedthe printing stage in case two\nI1);
printf ( " the present value of ref-numl in case 2 %d
\ngl,
ref numl) ;*/
/*-printing of the input stack */
i=O ;
/*
for(i=l;i<=ref-numl;++i)
{

printf (I1 the input stack is printed below with


ref-numl as %d \nl1,ref numl) ;
printf ("the zpcode is %d\nN,(numl + i)->func);
printf("the value of argument one is as follows
\n") ;
for(j=l;j<=B;j++)
{

i)->num-one[j])

printf ( "

% d

",(numl +

;
)

printf ("\nu);
printf("the value of argument two
\nw)

is as follows

for(j=l;j<=B;j++)
{

printf

(I1

%d

It,( n u m l +

i)->num-two [ j ]) ;

1
printf (I1\n1l)
;

1*/

ref numl = ref-numl


break;

/*

1;

case 3:
printf ( " case number three \nu);*/
(numl+ ref num1)->func = *(num2 + ref-num2) ;
for(i=l;i<EB;i++)
{

(numl+ ref numl)->num one[i] = * (num3[ref num2]+i) ;


(numl+ ref-numl)
->num-two
[i] = * (num4[ref-num2)
+i) ;
1
(numl+ ref num1)->over flow = 0;
(numl+ ref-num1)->weight = 0;
/ * printing of the input stack */
/*for(i=l;i<=ref-numl;i++)
{

printf (I1 the input stack is printed below with


ref-numl as %d \nw,ref-numl) ;

printf (Iftheopcode is %d\nI1,(numl + i)->func);


printf (Itthevalue of argument one is as follows
\nw);
for(j=l;j<=8;j++)
{

printf

%d

(I1

" , (numl +

i)->num-one[j]) ;

printf ("\nH);
printf(I1the value of argument two

is as follows

\n") ;
for(j=l;j<=8;j++)
{

printf ( "
i)->num-two[j])

%d

" ,(numl +

1
printf ("\nV1)
;

1*/

ref numl = ref-numl


break ;

1;

case 4:
/* the division case */
/*printf ( " case number four \nn);
printf (I1 entering compare load \nw);*/
compare-load (numl,num3,nui4,ref-num2,ref-numl ,num2) ;
if((num1

ref-num1)->over-flow

== 1)

subtract-load(numl,num2,num3,num41ref-num2,ref-numl);
1

)*

printing of the input stack


/*for(i=l;i<=ref -numl;i++)

*/

printf (I1 the input stack is printed below with


ref-numl as %d \nw,ref numl);
printf (Iftheopcode is %d\nl1,(numl + i)->func);
printf("the value of argument one is as follows
\nV1
) ;
for(j=l;j<=8;j++)
{

printf ( "

%d

",(numl +

i)->num-one[j ] ) ;

printf (n\nu);
printf("the value of argument two

is as follows

\nW);
for(j=l;j<=8;j++)
{

i)->num-two[ j]) ;

printf ( "

%d

11,( n u m l

1 */

printf (I1\n");

ref numl = ref-numl


break ;
1
1
return (*numl);

2;

1
void print outstack (duml)
struct8 *diml;
{

struct8 *tempi;
int i,j,k,l;
templ = duml;
for(i=O;i<=2O;i++)
{

printf ("\nl1);
printf (ll\nll)
;
~ r i n t f (the
~ original 1 . S number \nw);
printf (I1 %d \nw, (templ+i)->destination);
printf (ll\nM)
;
printf ("\nI1);
printf (I1 The result of the instruction (16

printf ("\nW);
printf (ll\nn)
;
printf (ll\nll)
;
{

printf (I1 %d It,(templ+i)->result[l6-j] )

1
printf ("\nn);
printf ("\nm);
printf (ll\nll)
;
1
1
void print-psstack(dum1)
struct9 *duml;
{

struct9 *tempi;
int i,j,k,l;
templ = duml;
for(i=O;i<=20;i++)
{

printf ("\nl1);
printf ("\nfl)
;
printf(" the tracking register number \nw);
printf (I1 %d \nvl,(templ+i)->address);
printf ("\nW);
printf (I1\nn)
;

0)\nql)
;

printf (I1the function number \nW) ;


printf (I1 %d \nu, (templ+i)->func);
printf ("\nW);
printf ("\nW);
printf (If The result of the num-one (0 - 8)\nV1)
;
printf ("\nW);
printf ("\nW);
printf ("\nW);
for(j=O;j<=8;j++)
{

printf (It %d 11,(templ+i)->num-one[ j ] )

printf (I1\nw)
;
printf (I1\nw)
;
printf (I1 The result of the num-two (0 - 8)\nw )
printf (ll\n");
printf ("\nI1);
for(j=O;j<=8;j++)
{

printf (I1 %d

11,

(templ+i)->num-two[j ] ) ;

printf (ll\n")
;
printf ("\nW);
printf (ll\n")
;
printf (l1\nl1)
;

1
}

.........................
..........................................

/******* Function Output Check **********/


..........................................

..........................
struct8 output check(num0,numl,num2,num3,num4,num41num51num61
num7, num8,num9)struct2 *numO; / * pointer to trans-pointer */
struct9 *numl; /* pointer to priority stack */
struct8 *numa; /* pointer to output structure */
struct3 *num3; / * pointer to div trac
*/
struct4 *num4; / * pointer to mult trac */
struct5 *num5; / * pointer to add trac */
struct6 *num6; / * pointer to logg sheet */
struct5 *numi'; / * pointer to sub track */
struct3 *num8; / * pointer to delta track */
structl *num9; /* pointer to multi-purpose registers*/
{

struct9 *templ; / * pointer to priority registers */


struct2 *tempo; / * pointer to cross collision matrices
struct8 *temp2;

/*

pointer to output structure

*/

*/

struct3 *temp3; / * pointer to div trac


*/
struct4 *temp4; / * pointer to mult trac */
struct5 *temp5; / * pointer to add trac */
struct6 *temp6; / * pointer to logg sheet */
struct5 *temp7; /* pointer to sub track */
struct3 *temp8; /* pointer to delta track */
structl *temp9; /* pointer to mpreg */
int i,j,k,l,remainder,local-index,future-index,get-out;
temp0 = num0;
temp2 = num2;
temp3 = num3;
temp4 = num4;
temp5 = num5;
temp6 = num6;
temp7 = num7;
temp8 = num8;
temp9 = num9;
temp1 = numl;
/ * checking of add trac */
if((temp6+1)->logg -stat-== 1)
{

printf(" add output-check is engaged \nw);


for(i=l;i<=g?i++)
{

if ( (temp6+1)->logg[i]

== 1)

if ( (temp5+i)->st-track[l]

== 1)

printf (It the output of add is being


k = (temp5+i)->address;
for(j=O;j<=l6;j++)
{

( t e m p 2 + k )- > r e s u l t [ j

(temp0+20)->word [ j ];
1
for(j=O;j<=6;j++)
{

(temp5+i)->st-track[ j ] = 0;
1
(temp6+1)->logg[i] = 0;
1

1
1
printf(" THE OUTPUT STACK AFTER LOADING ADDITION\nvv);
print-outstack(num2);

/* checking of sub trac */


if((temp6+2)->logg -stat-== 1)
{

printf(" sub-output-check is engaged \nn);

for(i=l;i<=g;i++)
{

if ( (temp6+2)->logg[i]

==

1)

if ( (temp7+i)->st-track[l]

==

1)

printf (I1 the output of sub is being


1oaded\ntt
) ;
k = (temp7+i)->address;
for(j=O;j<=l6;j++)
{

(temp2+k)->result[j]

(temp0+20)->word[j ];

for(j=O;j<=6;j++)
{

(temp7+i)->st-track[ j ] = 0;

(temp6+2)->logg[i] = 0;

1
1

printf ( " THE OUTPUT STACK AFTER LOADING


SUBTRACTION\nl1);
print-outstack(num2);

/* checking of mult trac


if ( (temp6+3)->logg-stat =: 1)
{

*/

print(" mult output-check is engaged \nw);


for(i=l;i<=g;T++)
{

if((temp6+3)->logg[i]
{

==

1)

if((temp4+i)->st -track[6]
{

==

1)

printf ( " the output of mult is being

loaded\nw) ;
k = (temp4+i)->address;
for (j=O;j<=16;j++)
{

( t e m p 2 + k )- > r e s u l t

(temp0+20)->word[j ] ;
1
for(j=O;j<=8;j++)

j]

printf(" THE OUTPUT STACK AFTER LOADING MULTIPLICATION


\nrl)
;
print-outstack(num2);

/ * checking of div trac */


if((temp6+4)->logg-stat-== 1)
{

printf(" div output-check is engaged \nu);


for(i=l;i<=g?i++)

track

/ * Division is present and the result */


/ * is ready to be iterated or sent to */
/ * priority stack. Checking for delta */
/ * or for the iteration number in the div
registers */
/* calculating the remainder */
get out = 0;
remainder = 0;
for(j=8;j<=15;j++)
{

remainder = remainder
20)->word[j ]

(temp0

I
==

0))

if ( ( (temp3+i)->itr-track

==

4) 1 I

(remainder

printf (I1 the output of div is being


loaded\n1I);
(temp3+i)->itr track = 0;
k = (temp3+i)-,address;
for(j=O;j<=l6;j++)
{

( t e m p 2 + k )- > r e s u l t [ j ]

(temp0+21)->word[ j ];

for(j=O;j<=8;j++)
{

(temp3+i)->st track[ j ] = 0;
(temp8+i)->st-track[j]
= 0;

(temp6+4)->logg[i] = 0;
(temp6+5)->logg[i] = 0;
(temp3+i)->itr track=O;
(temp8+i)->itr-track=0;
get-out = 1;
}

if (get-out

==

0)

/*

the iteration has to be carried out */


printf(" the data is going to be stored in

P.S\n1I);

/*

loading of data in priority stack


local index = (temp9 + 0)->bits[9];
future index = local index + 1;
/* loading of data */
for(j=O;j<=7;j++)

*/

( t e m p l + l o c a l- i n d e x ) - > n u m - one[j]
(temp0+21)->word[l5-j ];
(templ+local- index) ->num -two [ j+l]
(temp0+20)->word[15-j ];
(templ+future- index)->num - one[j]
(temp0+20)->word[15-j];
(templ+future- index)->num-two[j+l]
(temp0+20)->word[15-j ];

(temp9 + 0)->bits[9] = future index+l


(templ+local index)->num two[E] = 1;
/ * setting of priority fiag */
if ( (temp9+0)->bits [9]
{

==

=
=
=
=

(temp9+0)->bits[0])

printf(" the priority flag is set to 1


(temp9+0)->bits[l]

1;

Initialising the track register */


(templ+local index)->address = i;
(templ+future index)->address= i;
(templ+local Index)->func = 4;
(templ+future index)->func = 5;
/ * initialising the registers to zero
for(j=O;jc=8;j++)

/*

*/

printf(I1 the priority stack is printed


below \nn);
print-psstack (templ);

1
)

printf (I1 THE OUTPUT STACK AFTER WADING DIVISION \ntl)


;

print-outstack(num2) ;
return (*num2)

...........................

..........................................
/******* Function Shift Track **********/
/**********************%*****************I
..........................................

void shift track(numl,num2,num3,num4,num5,num6)


struct3 *numl,*num6; /* pointer to div trac and delta track

*/

struct4 *num2; /* pointer to mult trac */


struct5 *num3,*num5; / * pointer to add trac and sub track

*/

/* pointer to logg sheet */


*templ; /* pointer to div track
*/
*temp2; / * pointer to mult track */
*temp3; /* pointer to add track */
*temp4; /* pointer to logg sheet */
*temp5; / * pointer to sub track */
*temp6; /* pointer to delta track */

struct6 *num4;
{

struct3
struct4
struct5
struct6
struct5
struct3
int i,j,k,l;
temp1 = numl;
temp2 = num2;
temp3 = num3;
temp4 = num4;
temp5 = num5;
temp6 = num6;
/* shifting of add trac */
if ( (temp4+1)->logg-stat-== 1)
{

printf(I1 add-track is engaged to be shifted\nl1);


for(i=l;i<=g;i++)
{

if((temp4+1)->logg[i]
{

==

1)

for(j=l;j<=8;j++)
{

if ( (temp3+i)->st-track[ j ]
{

==

1)

1
1

/ * shifting of sub trac */


if((temp4+2)->logg-stat-== 1)
{

printf (It sub track is engaged to be shifted\nl1);


for(i=l;i<=gTi++)
{

if((temp4+2)->logg[i]
{

if ( (temp5+i)->st-track[ j ] == 1)
{

(temp5+i)->st track[ j+l] = 1;


(temp5+i)->st-track[
j ] = 0;
j=9;

/ * shifting of mult-trac
if((temp4+3)->logg-stat == 1)
{

1)

for(j=l;j<=8;j++)
{

==

*/

printf (If mult track is engaged to be shifted\nIf);


for(i=l;i<=9;i++)

/ * shifting of div trac */


if((temp4+4)->logg -stat-== 1)
f

printf (Ig div track is engaged to be shifted\nl1);


for(i=l;i<=gYi++)
{

/ * shifting of delta trac


if((temp4+5)->logg-stat =
:
1)
{

*/

printf(It delta track is engaged to be shifted\ntt);


for(i=l;i<=g;iT+)
{

if((temp4+5)->logg[i]
{

==

1)

for(j=l;j<=8;j++)
{

if ( (temp6+i)->st-track[ j ] == 1)
{

1
1
1
void status printl(numl,num2,num3)
structl *nuil;
struct5 *num2;
struct6 *num3;
{

structl *duml;
struct5 *dum2;
struct6 *dum3;
int iIj,kIl;
duml = numl;
duma = num2;
dum3 = num3;
printf("printing the pipeline register and flag
register and tracking registers and status logg\ntt);
printf (tt\ntt)
;
printf ("\ntt)
;

printf ( " the input registers ( 8 - 0 )\nI1);


printf ("\nu);
printf ("\nM);
printf(" the input register mpreg + 1 \nw);
for(j=O;j<=7;j++)
{

printf (I1 %d 11,(duml+l)->bits[7-j ]) ;


1
printf ("\nW);
printf (ll\nll)
;
printf ( " the input register mpreg + 2 \nV1)
;
for(j=O;j<=7;j++)
{

printf (I1 %d

",(duml+2)->bits[7-j 1 ) ;

printf ("\nW);
printf (It\nw)
;
printf (It 'the flag register \nI1);
printf ("\nu);
printf (ll\nll)
;
~ r i n t f (the
~ flag register is mpreg + 0 \nw);
for(i=O;i<=lO;i++)
{

printf (I1 %d
(duml+O)->bits[lo-i]) ;
1
printf (lt\nl1)
;
printf (It\nw)
;
printf(" the logging register for add\nw);
for(i=l;i<=g;i++)
{

printf ( " %d 11,(dum3+1)->logg[9-i]) ;


1
printf ("\nW);
printf ("\nW) ;
printf(" the tracking registers \nw);
for(i=l;i<=g;i++)
{

if ( (dum3+1)->logg[i]

==

1)

printf(" the tracking register number is %d and


the value of the address is %d \ntl,i, (dum2+i)->address
);

printf (I1\nw)
;
for(j=l;j<=8;j++)
{

printf (I1 %d ",(dum2+i)->st-track[8-j ])


1
printf ("\nM);
printf ("\nW);

void status print2(numllnum2,num3)


structl *nuGI;
struct5 *num2;
struct6 *num3;
{

structl *duml;
struct5 *dum2;
struct6 *dum3;
int iljlkll;
duml = numl;
duma = num2;
dum3 = num3;
printf("printing the pipeline register and flag
register and tracking registers and status logg\nrr);
printf ("\nrr)
;
printf ("\nW);
printf ( " the input registers ( 8 - 0 )\nn) ;
printf ("\nn);
printf ("\nrr)
;
printf(" the input register mpreg + 1 \nw);
for(j=O;j<=7;j++)
(

printf ( " %d It,(duml+l)->bits[7-j] ) ;


1
printf ("\nu);
printf ("\nW);
printf(" the input register mpreg + 2 \nw);
for(j=O;j<=7;j++)
(

printf ( " %d It,(duml+2)->bits[7-j 1 )

printf ("\nu);
printf ("\nW);
printf (Ir the flag register \nw);
printf ("\nW);
printf ("\nrl)
;
printf(" the flag register is mpreg
for(i=O;i<=lO;i++)

+ 0 \nrr);

printf(" %d It,(duml+O)->bits[lo-i])

printf ("\nu);
printf ("\nrl)
;
printf(" the logging register for sub\nn);
for(i=l;i<=9;i++)
(

printf (Ir %d

, (dum3+2)->logg[9-i] ) ;

printf ("\nVr)
;
printf ("\nW);
printf (Ir the tracking registers \nw);

for(i=l;i<=g;i++)
{

if ( (dum3+2)->logg[i]

==

1)

printf(I1 the tracking register number is %d and


the value of the address is %d \nI1,i, (dum2+i)->address
1;
printf ("\ntl)
;
for(j=O;j<=8;j++)
{

printf ( n %d 11,(dum2+i)->st-track[8-j ] )

printf (l1\nt1)
;
printf ("\nW);
1
1

void status print3(numl,num2,num3)


structl *nuGI;
struct4 *num2;
struct6 *num3 ;
{

structl *duml;
struct4 *dum2;
struct6 *dum3;
int ilj,kll;
duml = numl;
duma = num2 ;
dum3 = num3;
~rintf(~Iprinting
the pipeline register and flag
register and tracking registers and status logg\nI1);
printf ("\nV1)
;
printf ("\nn);
printf (I1 the input registers ( 8 - 0 )\nI1);
printf (I1\nl1)
;
printf (l1\nl1)
;
printf (I1 the input register mpreg + 3 \ntl)
;
for(j=O;j<=8;j++)
{

printf (I1 %d I l l (duml+3)->bits [8-j] ) ;


1
printf (ll\nw)
;
printf (I1\n1l)
;
printf ( " the input register mpreg + 4 \nI1);
for(j=O;j<=8;j++)
{

printf ( " %d If,(duml+4)->bits[8-j ] )

printf ("\n1I);
printf (I1\nw)
;
printf(" the flag register \nw);
printf ("\n1I);

printf (lt\nU)
;
printf(It the flag register is mpreg
for(i=O;i<=lO;i++)

+ 0 \nw);

printf(" %d w,(duml+~)->bits[lO-i]);
1
printf ("\nV1)
;
printf (I1\nM)
;
printf (I1 the logging register for add\nIf);
for(i=l;i<=lO;i++)
{

printf (I1 %d ",(dum3+3)->logg[9-i] ) ;


1
printf ("\nW);
printf (n\nn);
printf (It the tracking registers \nH);
for(i=l;i<=g;i++)
{

if ( (dum3+3)->logg[i]

==

1)

printf(ll the tracking register number is %d and


the value of the address is %d \ntt,i, (dum2+i)->address
1;
printf ("\nW);
for(j=l;j<=8;j++)
{

printf (If %d If,(dum2+i)->st-track[8-j ] )

printf ("\nu);
printf (I1\n1')
;
1
1
1

Goid status print4 (numl,num2 ,num3)


structl *nuiil;
struct3 *num2;
struct6 *num3;
{

structl *duml;
struct3 *dum2;
struct6 *dum3;
int i,j,k,l;
duml = numl;
dum2 = num2;
dum3 = nurn3;
printf ("printing the pipeline register and flag
register and tracking registers and status logg\nIt);
printf ("\nI8);
printf ("\n8I);
printf ( " the input registers ( 8 - 0 )\ntv)
;
printf ("\nW);
printf ("\nW);

printf(IV the input register mpreg + 3 \nn);


for(j=O;j<=8;j++)

1
printf (Ig\ngt)
;
printf (I1\n");
printf ( " the input register mpreg + 4 \nV1)
;
for(j=O;j<=8;j++)
{

printf (IV %d ",(duml+4)->bits[8-j1 ) ;


1
printf ("\nu);
printf ("\nW);
printf (It the flag register \ntl)
;
printf ("\nW);
printf ("\nVt)
;
printf(lV the flag register is mpreg
for(i=O;i<=lO;i++)

+ 0 \n") ;

printf ( " %d ",(duml+O)->bits [lo-i]) ;


1
printf ("\nun)
;
printf ("\nW);
printf ( " the logging register for div\nvv)
;
for(i=l;i<=g;i++)
{

printf(" %d w,(dum3+4)->logg[9-i]);
1
printf ("\nW);
printf ("\nu);
printf (Iw the tracking registers \nw);
for(i=l;i<=g;i++)
{

if ( (dum3+4)->logg[i]

==

1)

printf(" the tracking register number is %d and


the value of the address is %d \nw,i, (dum2+i)->address
):

printf (n\nw);
for(j=l;j<=8;j++)
{

printf (It %d 'I,(dum2+i)->st-track[8-j ] )


1
printf ("\nu);
printf (tV\nll)
;
1
1

1
..........................................
..........................................
..........................................

/*******

Function Load Pipeline

*********/

..........................................

...................................
/***** 0. P.F indicator. ********/
/***** 1. Priority Flag. ********/
/***** 2. Stack Index.
********/
/***** 3. CCM Pointer.
********/
/***** 4. ADD Latency.
********/
/***** 5. SUB Latency.
********/
/***** 6. MULT Latency. ********/
/***** 7. DIV Latency.
********/
/***** 8. Priority Index. ********/
/***** 9. Local Index.
********/
...................................
structlload p i p e l i n e ( n u m 0 , n u m l , n u m 2 , n u m 3 , n u m 4 , n u m 5 ,
num6,num7,num8,num9,numl0,numll,numl2)
structl *numO; / * pointer to input registers */
structO *numl; /* pointer to cross collision matrices
struct7 *num2; / * pointer to input structure */
struct3 *num3; / * pointer to div trac
*/
struct4 *num4; / * pointer to mult trac */
struct5 *num5; / * pointer to add trac */
struct6 *num6; / * pointer to logg sheet */
struct9 *numlO; / * pointer to priority structure */
struct5 *numll;/* pointer to subtract trac */
struct3 *numl2;/* pointer to delta trac*/
int num7,num8,num9; / * registers */

*/

structl *tempo; / * pointer to input registers */


structO *tempi; / * pointer to cross collision matrices

*/

struct7 *temp2; / * pointer to input structure */


struct3 *temp3; / * pointer to div trac
*/
struct4 *temp4; / * pointer to mult trac */
struct5 *temp5; / * pointer to add trac */
struct6 *temp6; / * pointer to logg sheet */
struct9 *templo;/* pointer to priority structure*/
struct5 *templl;/* pointer to subtract trac */
struct3 *templ2;/* pointer to delta trac*/
i
n
t
i,j,k,l,priority
-flag,stack-index,matrix-index,look-ahead;
intpriority-index,additional-entry,divisional-entry,dis;

init key = 0;
delta flag = 0;
addition = 0;
subtraction = 0;
multiplication = 0;
division = 0;
temp0 = num0;
temp1 = numl;
temp2 = num2;
temp3 = num3;
temp4 = num4;
temp5 = num5;
temp6 = num6;
temp10 = numl0;
temp11 = numll;
temp12 = numl2;
priority-flag = (num0+0)->bits[l];/*loading the priority
flag */
stack index = (numO+O)->bits [a] ;/* loading the current
instruction location */
matrix index = (num0+0)->bits[3] ;/* loading the current
address of CCM */
priority index = (temp0+0)->bits[8];
look-ahead = stack index + 1;
/ * checking for any priority situation */
if (priority-flag == 1)
{

printf (I1 the priority flag is one and entering case


fnc\n1I);

switch((templ0

+ priority-index)->func)

case 4:
if((templ+matrix index)->smatrix.bits - r o w 1
[ (temp0+0)->bits [lo]1 == 0)
{

/*
/*

in here it will be determined wether div */


can be added to pipe with add or sub*/
printf(" div is posibble and checking to see if
additional functions are possible and main case is 4 \n1I);

case 1:
if((templ+matrix -index)->smatrix.bits-row3
(tempO+O)->bits[lo]1 == 0)
{

1
else

divisional entry = 1;
printf ( " divisional entry is l\ngl)
;
printf (It addition is also possible\nn);

divisional entry = 0;
printf(I1 though addition is the next
instruction no latency is available \n1I);
printf(" divisional entry is O\n1I);
1
break;
case 2:
if((templ+matrix -index)->smatrix.bits-row3
[ (tempO+O)->bits[lo]1 == 0)
{

divisional entry = 2;
;
printf (I1 dTvisional entry is 2\nss)
p r i n t f (I1 s u b t r a c t i o n i s a l s o

possible\nw);
1
else
{

divisional entry = 0;
printf ( " though subtraction is the next
instruction no latency is available \nw);
printf ( " divisional entry is O\nsl)
;
1
break;
default :
divisional entry = 0;
printf ( " only division is possible \nI1);
printf ( " divisional entry is O\n1l);
break;
1
else
{

printf("no latency available to process p.s\nu);

(tempO+O)->bits[10]+=1;
printf(I1the next latency is
%d\nn,(tempO+O)->bits[10] ) ;
printf(" initialising the input registers to
0\n1l);
init key =l;
for(T=o;ie=8;i++)
{

1
1
break;
case 5:

(tempO+l)->bits[i] = 0;
(temp0+2)->bits[i] = 0;
(temp0+3)->bits [i] = 0;
(temp0+4)->bits[i] = 0;

printf(" the case is 5 and delta is being loaded


in priority stack is 1 \n1I);
for(j=O; j<=7;j++)
{

registers

*/

/*

this part will initiate the trackng

dis = (temp10 + priority index)->address;


(temp12 + dis)->st-track711 =1;

/*

initialising the registers

*/

(temp0+0)->bits[8]+=1;
divisional entry = 4;
delta-flag-= 1;

*/

/*

re initialising the priority index

/*

checking and initialising the priority flag

*/

printf ( " the priority-flag is init to 0 \nit);


(temp~+O)
->bits[1] = 0;
setting of priority index
(ternp~+O)->bits[O]+=2;

/*

*/

/*

printing of the results of case 5

*/

printf("printing the pipeline register and flag


register\nI1);
printf ("\nu);
printf ("\n1I);
printf ( I 1 the input registers temp0 + 3 ( 7 0)\nll);
printf ("\nu);
printf (l1\nl1)
;

f o r ( j = O ; j < = 7 ;j++)
{

printf

(It

%d ",(temp0+3)- > b i t s [ 7 - j ] ) ;

p r i n t f ( t v \ n w;)
p r i n t f ("\nu) ;
p r i n t f ( " t h e i n p u t r e g i s t e r s temp0
0 ) \nl1) ;

(8

p r i n t f ( " \ n W );
p r i n t f (I1\nw);
for(j=O;j<=8;j++)
{

p r i n t f ( " %d ",(temp0+4)- > b i t s [ 8 - j ] ) ;

p r i n t f ( l l \ n l l );
p r i n t f ("\n1I) ;
p r i n t f (I1 t h e f l a g r e g i s t e r 8
p r i n t f ( " \ n W );
p r i n t f ( " \ n W );
for(i=O;i<=8;i++)
{

printf

(I1

%d

I!,

O\nfv);

(temp0+0)- > b i t s [ 8 - i ] ) ;

p r i n t f ( " \ n W );
p r i n t f ("\ntt);
p r i n t f ( " t h e logging r e g i s t e r s 9
p r i n t f ( " \ n W );
p r i n t f ( " \ n W );
f o r ( i = li ;< = 9 ;i + + )
{

printf

(I1

O\nw);

%d ",(temp6+5)- > l o g g [ 9 - i ] ) ;

p r i n t f ( l t \ n l l );
p r i n t f ( t f \ n w;
)
p r i n t f ( " the tracking registers \nw);
p r i n t f ( " \ n W );
p r i n t f ("\nu) ;
for(i=l;i<=g;i++)
{

i f ( (temp6+5)- > l o g g [ i ] == 1)
{

p r i n t f ( " t h e t r a c k i n g r e g i s t e r number i s %d and


t h e v a l u e of t h e a d d r e s s i s %d \n", i , ( t e m p l 2 + i )- > a d d r e s s ) ;
p r i n t f ( l 8 \ n N;
)
p r i n t f ( " \ n W );
for(j=O;j<=8;j++)
{

p r i n t f ( " %d

1
1
1

I!,

( t e m p l 2 + i )->st-t r a c k [ 8 - j ] ) ;

printf ("\r~*~) ;
printf ("\nW);
break;
1
4

/*

*/

this section below assigns the values for case

switch (divisional-entry)
{

case 0:
/ * the division is being loaded */
printf (I1 t h e latency i s a v a i l a b l e for
iteration\nI1);
/ * loading the arguments into the stage div */
for(j=O;j<=7;j++)
{

( t e m p 0 + 3 ) - > b i t s [ j ]
(templO+priority-index)->num-one[j];
1
for(j=O;j<=8;j++)

( t e m p 0 + 4 ) - > b i t s [ j ]
(templO+priority-index)->num-two[j];
1
( t e m p O + O ) - > b i t s [ 3 ]
(templ+matrix index)->sdirection.div -latency
[(temp0+0)->bits[To]];
/ * this part will initiate the trackng
registers */
dis = (temp10 + priority index)->address;
(temp3 + dis)->st-track[i] =1;
(temp0+0)->bits[lo] = 0;
(tempO+O)->bits[8]+=1;
printf(I1 the division status is printed below in
case O\n iteration \nt1);
status print4(tempO,temp3,temp6);
division = 1;
break;
case 1:
the addition is being loaded */
printf ( " the latency is available for
iteration\nl1);
/ * loading the arguments into the stage add */
for(j=O;j<=7;j++)

/*

( t e m p O + l ) - > b i t s [ j ]
(temp2+stack-index)->num-one[j+l];
( t e m p 0 + 2 ) - > b i t s [ j ]
(temp2+stack-index)->num-two[j+l] ;

( t e m p O + O ) - > b i t s [ 3 ]
(templ+matrix index)->sdirection.div -add
[ (temp0+0)->bits[lo71;
(tempO+O)->bits[2]+=1;
addition = 1;
/ * this part will initiate the trackng
registers */
for(i=l;i<=9;i++)
{

if ( (temp6+1)->logg[i]

==

0)

(temp6+1)->logg[i] = 1;
(temp5+i)->st track[l]=l;
( t e m p 5 q i ) - > a d d r e s s
(temp2+look-ahead)->location;
i=lO;
1
1
printf(lV the addition status is printed below in
case 1 in iteration\nw);
status printl(tempOItemp5,temp6);
/ * the-division is being loaded */
printf (I1 the latency is available in iteration
\nW);
/* loading the arguments into the stage div */
for(j=O;j<=7;j++)

( t e m p O + O ) - > b i t s [ 3 ]
( t e m p l + m a t r i x i n d e x ) - > s d i r e c t i o n . d i v- a d d
[ (temp0+0)->bits[lo71;
/ * this part will initiate the trackng
registers */
dis = (temp10 + priority index)->address;
(temp3 + dis)->st-track[i] =l;
(temp0+0)->bits[lo] = 0;
(tempO+O)->bits[8]+=1;
printf(" the division status is printed below in
;
case 1 in iteration\nl@)
status print4(tempOItemp3,temp6);
division = 1;
break;

case 2:

/* the subtraction is being loaded */


printf(If the latency is available \nn);
/* loading the arguments into the stage

sub

for(j=O;j<=7;j++)

*/

( t e m p O + l ) - > b i t s [ j ]
(temp2+stack-index)->num-one [ j+l] ;
( t e m p 0 + 2 ) - > b i t s [ j ]
(temp2+stack-index)->num-two[j+l];
1
( t e m p O + O ) - > b i t s [ 3 ]
(templ+matrix index)->sdirection.div -add
[ (tempO+O)->bits[lo71;
(tempO+O)->bits[2]+=1;
subtraction = 1;
/ * this part will initiate the trackng
registers */
for(i=l;i<=g;i++)
{

if((temp6+2)->logg[i] == 0)
{

(temp6+2)->logg[i] = 1;
(templl+i)->st track[l]=l;
( t e m p l l q i ) - > a d d r e s s
(temp2+look-ahead)->location;
i=10;
1
1
printf(" the subtraction status is printed below
;
in case 2 in iteration\nff)
status~print2(temp0,templl,temp6);

/* the division is beinq loaded */


printf (If the latency is-available.\nff)
;
/* loading the arguments into the stage div */
for(j=O;j<=7;j++)
{

( t e m p 0 + 3 ) - > b i t s [ j ]
(templO+priority-index)->num-one[j];
1
for(j=O;j<=8;j++)
{

( t e m p 0 + 4 ) - > b i t s [ j ]
(templO+priority-index)->num-two[j];
1
( t e m p O + O ) - > b i t s [ 3 ]
(templ+matrix index)->sdirection.div -add
[ (temp0+0)->bits[lo71;
/ * this part will initiate the trackng
registers */
dis = (temp10 + priority index)-Baddress;
(temp3 + dis)->st-track[i] =l;

(temp0+0)->bits [lo] = 0;
(temp0+0)->bits[8]+=1;
printf(" the division status is printed below in
case 12\nl1);
status print4(tempO,ternp3,temp6);
division = 1;
break;
case 4:
break;

1
else
{

/*
/*

This condition indicates no division is awaiting


any iterations

*/
*/

checking for additon or subtraction */


now checking for the type of loading */
case 1 -> add only
*/
case 2 -> add and multiplication
*/
case 3 -> add and division
*/
case 4 -> sub only
*/
case 5 -> sub and multiplication
*/
case 6 -> sub and division
*/
case 7 -> mult only
*/
case 8 -> mult and addition
*/
case 9 -> mult and subtraction */
case 10 -> div only
*/
case 11 -> div and addition
*/
case 12 -> div and subtraction */
printf(" priority -flag is zero and entering case
fnc\nw);
switch ( (temp2+stack-index)->func)
{

case 1:

/*
/*

in here it will be determined wether add */


can be added to pipe with mult or div */
if((templ+matrix index)->smatrix.bits - row3
(temp0+0)->bits[4] 1 == 0)
{

printf ( " add is posibble and checking to see if


additional functions are possible and main case is 1 \riff);
switch((temp2+look -ahead)->func)
{

case 3 :
if((templ+matrix -index)->smatrix.bits-row2
[ (temp0+0)->bits[4]1 == 0)

additional entry = 2;
printf (I1 additional entry is 2\n1I);
printf(I1 m u l t i p l i c a t i o n i s a l s o

possible\nI1);

1
else
{

additional entry = 1;
printf ( " additional entry is l\nV1)
;
printf(" though multiplication is the
next instruction no latency is available \nI1);
)

break;
case 4:
if((templ+matrix -index)->smatrix.bits-row1
[ (temp0+0)->bits [4]1 == 0)
{

additional entry = 3;
printf (11 additional entry is 3\nw);
printf (I1 division is also p~ssible\n~~)
;

else
{

additional entry = 1;
printf (I1 additional entry is l\nI1);
printf(" though division is the next
instruction no latency is available \nI1);
1
break;
default:
additional entry = 1;
printf ( " only addition is possible \nI1);
break;

1
else
{

printf (I1 no latency is available for addition\nI1);

(tempO+O)->bits[4]+=1;
(temp0+0)->bits[10]+=1;
printf (I1 the next latency to look for is %d
\nI1,(tempO+O)->bits [4]) ;
additional entry = 0;
if((tempo+E)->bits[4]==8)
{

p r i n t f (I1 no l a t e n c y is a v a i l a b l e and
reinitialising to matrix 3 \nI1);
(temp0+0)->bits[4] = 0;

1
1
break;
case 2:

in here it will be determined wether sub */


can be added to pipe with mult or div */
if((templ+matrix - index)->smatrix.bits-row3
(temp0+0)->bits[5]1 == 0)

/*

/*

printf(Iv sub is posibble and checking to see if


additional functions are possible and main case is 2 \nvv);

case 3:
if((templ+matrix-index)->smatrix.bits-row2
[ (temp0+0)->bits[5] 1 == 0)
{

additional entry = 5;
printf ( " additional entry is 5\nvf)
;
printf(Iv m u l t i p l i c a t i o n i s a l s o

possible\nw);

1
else
{

additional entry = 4;
printf ( n additional entry is 4\nvv)
;
printf(Iv though multiplication is the
next instruction no latency is available \nvv);
1
break;
case 4:
if((templ+matrix -index)->smatrix.bits-row1
[ (temp0+0)->bits[5] 1 == 0)
{

additional entry = 6;
~ r i n t f (additional
~
entry is 6\nn);
printf ( " division is also possible\nvv)
;

1
else
{

additional entry = 4;
printf (Iv additional entry is 4\nvv)
;
printf(Iv though division is the next
instruction no latency is available \nvv);
1
break;
default:
additional-entry = 4;

printf(I1 only subtraction is possible


printf (I1 additional entry is 4\nH) ;
break;

1
else

p r i n t f (I1 n o l a t e n c y i s a v a i l a b l e f o r
subtraction\nw);
(tempO+O)->bits[5]+=1;
(temp0+0)->bits[10]+=1;
printf ( I 1 the next latency t o look for is %d
\nV1,
(tempO+O)->bits[5]) ;
additional entry = 0;
if((tempo+z)->bits[5]==8)
{

p r i n t f (I1 n o l a t e n c y i s a v a i l a b l e a n d
reinitialising to matrix 3 \nu);
(temp0+0)->bits[5] = 0;
(temp0+0)->bits[3] = 3;
1
1
break;
case 3:

/*
/*

in here it will be determined wether mult

can be added to pipe with add or sub*/


if((templ+matrix - index)->smatrix.bits
[ (temp0+0)->bits[6]1 == 0)

*/

-row2

printf(I1 mult is posibble and checking to see if


additional functions are possible and main case is 3 \nW);
switch ( (temp2+look-ahead)->func)
{

case 1:
if((templ+matrix -index)->smatrix.bits-row3
[ (temp0+0)->bits[6]1 == 0)
{

additional entry = 8;
printf ( " additional entry is 8\nm1)
;
printf (I1 addition is also possible\ntl)
;

1
else
{

additional entry = 7;
printf(" though addition is the next
instruction no latency is available \nn);
printf (Is additional entry is 7\nN);

break;
case 2:
if ( (templ+matrix-index)->smatrix.bits-row3
[ (temp0+0)->bits[6]] == 0)
{

additional entry = 9;
printf(" additional entry is 9\nw);
p r i n t f (I1 s u b t r a c t i o n i s a l s o
possible\nw);

1
else
{

additional entry = 7;
printf ( " though subtraction is the next
instruction no latency is available \nw);
printf (I1 additional entry is 7\nw);

break;
default:
additional entry = 7;
printf ( " only multiplication is possible
print(" additional entry is 7\nw);
break;

1
else
{

p r i n t f (I1 n o l a t e n c y i s a v a i l a b l e f o r
multiplication\nw);
(temp0+0)->bits[6]+=1;
(tempO+O)->bits[10]+=1;
printf ( " the next latency to look for is % d
\n", (tempO+O)->bits[6] ) ;
additional entry = 0;
if((tempO+c)->bits[6]==8)
{

p r i n t f (I1 n o l a t e n c y i s a v a i l a b l e a n d
reinitialising to matrix 2 \nv);
(temp0+0)->bits [6] = 0;
(tempO+O)->bits[3] = 2;
1
1
break:
case 4:

in here it will be determined wether div */


can be added to pipe with add or sub*/
if((templ+matrix -index)->smatrix.bits -row1
[ (temp0+0)->bits[7]1 == 0)

/*
/*
{

printf(" div is posibble and checking to see if


;
additional functions are possible and main case is 4 \ntt)
switch( (temp2+(look-ahead+l)) ->func)
{

case 1:
if((templ+matrix-index)->smatrix.bits-row3
[ (temp0+0)->bits[7]1 == 0)
{

additional entry = 11;


printf ( " additional entry is ll\nM);
printf (It addition is also possible\nn);

1
else
{

additional entry = 10;


printf(It t-hough addition is the next
instruction no latency is available \ntt);
printf (It additional entry is 10\ntt)
;
1
break;
case 2:
if((templ+matrix -index)->smatrix.bits-row3
[ (temp0+0)->bits[7]1 == 0)
{

additional entry = 12;


printf (It aaditional entry is 12\11");
p r i n t f (It s u b t r a c t i o n i s a l s o

possible\nw);

1
else
{

additional-entry = 10;
printf(" though subtraction is the next
instruction no latency is available \nw);
printf ( " additional entry is 10\ntt)
;
1
break;
default:
additional entry = 10;
;
printf ( " only division is possible \ntt)
printf(" additional entry is 10\ntt);
break;

1
else
{

printf(It no latency is available for division\ntt);

(temp0+0)->bits[lo]+=I;
printf ( " the next latency to look for is %d
\nvv,
(tempO+O)->bits[7] ) ;
additional entry = 0;
if ( (tempo+o)->bits[7]==8)
{

p r i n t f (Iv n o l a t e n c y i s a v a i l a b l e a n d
reinitialising to matrix 1 \nvv)
;
(temp0+0)->bits[7] = 0;
(temp0+0)->bits[3] = 1;
1

break;
case 5:
printf(" the case is 5 and delta is being loaded
wherein the priority index is O\nN);
for(j=O;jc=7;j++)

( t e m p 0 + 4 ) - > b i t s [ j ]
(temp2+stack-index)->num-two[j];
1
if (re-adjust == 1)

printf ("\nvv)
;
printf(Iv the re-adjust is recognised as
1 and re-adjust is assigned 0 and stack index is doubally
;
incremented\nvv)
printf ("\nu);
re-adjust = 0;
1
else
{

(temp0+0)->bits[2]+=1;
printf (Iv\nw)
;
printf(Iv the re-adjust is recognised as
0 and the instruction stack flag is singlely
incremented\nvv)
;
printf (Iv\nu)
;
1
delta-flag = 1;
additional entry = 0;
/ * this part will initiate the trackng
registers */
for(i=l;ic=9;i++)

(temp6+5)->logg[i] = 1;
(templ2+i)->st track[l]=l;
( t e m p 1 2 i ) - > a d d r e s s
(temp2+stack-index)->location;
i=lO;
1
1
/ * printing of the results of case 5 */
printf ("printing the pipeline register and flag
register\nl1);
printf ("\nIf);
printf ("\nlt);
printf (I8 the input registers temp0 + 3 (7 0)\nt');
printf (lt\nll)
;
printf (I1\nf1)
;
for(j=O;j<=7;j++)
{

printf ( " %d ", (temp0+3)->bits[7-j] ) ;


1
printf ( ll\nvl)
;
printf (I1\nt1)
;
printf (I1 the input registers temp0 + 4 (8
0)\nl1);
printf (I1\nw)
;
printf (l1\nIf)
;
for(j=O;j<=8;j++)
{

printf ( " %d ",(temp0+4)->bits [8-j] ) ;


1
printf (ll\nN)
;
printf ("\nl1);
printf (If the flag register 8 - O\nft)
;
printf (vt\nN)
;
printf ("\nu);
for(i=O;i<=8;i++)
{

printf ( " %d It,(temp0+0)->bits[8-i] ) ;


1
printf (It\nw)
;
printf ("\nW);
printf(" the logging registers 9 - O\nW);
printf (ll\nll)
;
printf ("\nu);
for(i=l;i<=g;i++)
{

printf ( " %d !I,(temp6+5)->logg[9-i]) ;


1
printf ("\nvr)
;
printf ("\nit);
printf (It the tracking registers \nN);

printf ("\nW);
printf ("\nW);
for(i=l;i<=g;i++)
{

if ( (temp6+5)->logg[i] == 1)
{

printf(" the tracking register number is %d and


the value of the address is %d \n~lil(temp12+i)->address);
printf ("\nW);
printf (n\ngg)
;
for(j=O;j<=8;j++)
{

printf ( " %d ",(templ2+i)->st-track[8-j ]) ;


1
1
1
printf ("\nl@)
;
printf ("\nW);
break;
1
switch(additiona1-entry)
{

case 1:
/* the addition is being loaded */
printf(Ig the additional entry is 1 and addition
only \nl');
/* loading the arguments into the stage add */
for(j=O;j<=7;j++)
{

( t e m p O + l ) - > b i t s [ j ]
(temp2+stack-index)->num-one [j+l];
( t e m p 0 + 2 ) - > b i t s [ j ]
(temp2+stack-index)->num-two [ j+l] ;

( t e m p O + O ) - > b i t s [ 3 ]
(templ+matrix index)->sdirection.add -latency
[(temp0+0)->bits[z]];
(temp0+0)->bits [4] = 0;
(tempO+O)->bits[10] = 0;
(tempO+O)->bits[2]+=1;
addition = 1;
/ * this part will initiate the trackng
registers */
for(i=l;i<=g;i++)
{

if((temp6+1)->logg[i]

== 0)

(temp6+1)->logg[i] = 1;
(temp5+i)->st track[l]=l;
( t e m p 5 T i ) - > a d d r e s s

(temp2+stack-index)->location;
i=lO ;

1
1

printf(" the addition status is printed below in


case l\nIt);
status-printl(ternpOttemp5,temp6);
break;
case 2:

/ * the addition is being loaded */


printf(" the latency is available \nn);
/ * loading the arguments into the stage

add

*/

for(j=O;j<=7;j++)
{

( t e m p O + l ) - > b i t s [ j ]
(temp2+stack-index)->num-one [ j+l] ;
( t e m p 0 + 2 ) - > b i t s [ j ]
(temp2+stack-index)->num-two [ j+l] ;
1
( t e m p O + O ) - > b i t s [ 3 ]
(templ+matrix index)->sdirection.mult -add
[ (temp0+0)->bits [4 ;
(temp0+0)->bits[2]+=1;
(tempO+O)->bits[10] = 0;
addition = 1;
/ * this part will initiate the trackng
registers */
for(i=l;i<=g;i++)

r]

if((temp6+1)->logg[i]

==

0)

(temp6+1)->logg[i] = 1;
(temp5+i)->st track[l]=l;
( t e m p 5 T i ) - > a d d r e s s
(temp2+stack-index)->location;
i=lO;
1

printf(" the addition status is printed below in


case 2\nw);
status printl(tempOttemp5,temp6);
/* the-multiplication is being loaded */
printf ( " the latency is available \nll);
/ * loading the arguments into the stage mult */
for(j=O;j<=7;j++)

( t e m p O + O ) - > b i t s [ 3 ]
(templ+matrix index)->sdirection.mult -add
[ (tempO+O)->bits[4
(tempO+O)->bits[4] = 0;
(tempO+O)->bits[2]+=1;
/ * this part will initiate the trackng
registers */
for(i=l;i<=9;i++)

n;

1
printf(" the multiplication status is printed
below in case 2\nn);
status print3(tempO,temp4,ternp6);
multiplication = 1;
break;
case 3 :
the addition is being loaded */
printf(" the latency is available \nl');
/ * loading the arguments into the stage add
for(j=O;j<=7; j++)

/*

*/

( t e m p O + l ) - > b i t s [ j ]
(temp2+stack-index)->num-one [ j+l] ;
( t e m p 0 + 2 ) - > b i t s [ j ]
(temp2+stack-index)->num-two[j+l];
1
( t e m p O + O ) - > b i t s [ 3 ]
(templ+matrix index)->sdirection.div -add
[ (tempO+O)->bits[4]3
(temp0+0)->bits[2]+=I;
(temp0+0)->bits[lo] = 0;
addition = 1;
/ * this part will initiate the trackng
registers */
for(i=l;i<=9;i++)
{

if ( (temp6+1)->logg[i]

==

0)

1
1

printf (I1 the addittion status is printed below in


case 3\nI1);
status printl(tempOItemp5,temp6);
/ * the-division is being loaded */
hrintf (I1 the latency is-available.\n") ;
/ * loading the arguments into the stage div */
for(j=O;j<=7;j++)
{

( t e m p 0 + 3 ) - > b i t s [ j ]
(temp2+look-ahead)->num-one [ j+l] ;

for(j=O;j<=8;j++)
{

( t e m p 0 + 4 ) - > b i t s [ j ]
(temp2+look-ahead)->num-two [ j ];
1
( t e m p O + O ) - > b i t s [ 3 ]
(templ+matrix index)->sdirection.div -add
[ (temp0+0)->bits[4]>
(temp0+0)->bits[4] = 0;
(temp0+0)->bits[2]+=1;
/ * this part will initiate the trackng
registers */
for(i=l;i<=g;i++)
{

if((temp6+4)->logg[i]

== 0)

(temp6+4)->logg[i] = 1;
(temp3+i)->st track[l] =l;
( t e m p 3 T i ) - > a d d r e s s
(temp2+look-ahead)->location;
i=10;

1
1

printf(" the division status is printed below in


case 3\nN);
status print4(tempOItemp3,temp4);
division = 1;
break;
case 4:

/ * the subtraction is being loaded */


printf(" the latency is available \nI1);
/ * loading the arguments into the stage add */
for(j=O;j<=7;j++)
{

( t e m p O + l ) - > b i t s [ j ]
(temp2+stack-index)->num-one[j+l];
( t e m p 0 + 2 ) - > b i t s [ j ]
(temp2+stack-index)->num-two[j+l];

( t e m p O + O ) - > b i t s [ 3 ]
(templ+matrix index)->sdirection.add -latency
[(temp0+0)->bits[%]];
(temp0+0)->bits[5] = 0;
(tempO+O)->bits[10] = 0;
(tempO+O)->bits[2]+=1;
subtraction = 1;
/ * this part will initiate the trackng
registers */
for(i=l;i<=g;i++)
{

if((temp6+2)->logg[i]

==

0)

(temp6+2)->logg[i] = 1;
(templl+i)->st track[l]=l;
( t e m p l l q i ) - > a d d r e s s
(temp2+stack-index)->location;
i=10;
1
1
printf(It the subtraction status is printed below
in case 4\nw) ;
status-print2(tempOftempll,temp6);
break;
case 5:

/ * the subtraction is being loaded */


printf(I1 the latency is available \nw);
/ * loading the arguments into the stage

sub

*/

for(j=O;j<=7;j++)
{

( t e m p O + l ) - > b i t s [ j ]
(temp2+stack-index)->num-one[j+l];
( t e m p 0 + 2 ) - > b i t s [ j ]
(temp2+stack-index)-mum-two [j+l];
1
( t e m p 0 + 0 ) - > b i t s [ 3 ]
templ+matrix index)->sdirection.mult -add
(tempO+O)->bits[5r];
(tempO+O)->bits[2]+=1;
(temp0+0)->bits[10] = 0;
subtraction = 1;
/ * this part will initiate the trackng
registers */
for(i=l;i<=g;i++)
{

if((temp6+2)->logg[i]

==

0)

printf(" the subtraction status is printed below


in case 5\nw);
status print2(tempO,templl,temp6);
/* the-multiplication is being loaded */
printf(" the latency is available \nw);
/* loading the arguments into the stage mult */
for(j=O;j<=7;j++)
{

( t e m p 0 + 3 ) - > b i t s [ j ]
(temp2+look-ahead)->num one[j+l];
( t e m p O + 4 - > b i t s [ j ]
(temp2+look-ahead)->num-two[j+l] ;

( t e m p O + O ) - > b i t s [ 3 ]
(templ+matrix index)->sdirection.mult -add
[ (tempO+O)->bits[5r] ;
(temp0+0)->bits[5] = 0;
(temp0+0)->bits[2]+=1;
/ * this part will initiate the trackng
registers */
for(i=l;i<=9;i++)
{

if((temp6+3)->logg[i]

==

0)

(temp6+3)->logg[i] = 1;
(temp4+i)->st track[l]=l;
( t e m p 4 T i ) - > a d d r e s s
(temp2+look-ahead)->location;
i=lO;
1
1
printf (It the multiplication status is printed
below in case 5\nw);
status print3(tempO1temp4,temp6);
multiplication = 1;
break;
case 6:
/* the subtraction is being loaded */
printf(" the latency is available \nw);
/ * loading the arguments into the stage sub
for(j=O;j<=7;j++)
{

( t e m p O + l ) - > b i t s [ j ]
(temp2+stack-index)->num one [j+l];
( t e m p 0 + 2 ) - > b i t s [ j ]

*/
-

( t e m p O + O ) - > b i t s [ 3 ]
t e m p l + m a t r i x - i n d e x ) - > s d i r e c t i o n . d i v -add
(temp0+0)->bits[5] 1 ;
(temp0+0)->bits[2]+=1;
(temp0+0)->bits[10] = 0;
subtraction = 1;
/ * this part will initiate the trackng
registers */
for(i=l;i<=g;i++)
{

if ( (temp6+2)->logg[i]

==

0)

(temp6+2)->logg[i] = 1;
(templl+i)->st track[l]=l;
( t e m p l l - t i )- > a d d r e s s
(temp2+stack-index)->location;
i=10;
1

printf(" the subtraction status is printed below


in case 6\nlt);
status print2(tempO,templl,temp6);
/* the-division is being loaded */
printf(" the latency is available \nu);
/* loading the arguments into the stage div */
for(j=O;j<=7;j++)
{

( t e m p 0 + 3 ) - > b i t s [ j ]
(temp2+look-ahead)->num-one [ j+l] ;
1
for(j=O;j<=8;j++)
{

( t e m p 0 + 4 ) - > b i t s [ j ]
(temp2+look-ahead)->num-two[j];

( t e m p O + O ) - > b i t s [ 3 ]
(templ+matrix index)->sdirection.div -add
[ (temp0+0)->bits[5]>
(tempO+O)->bits[5] = 0;
(temp0+0)->bits[2]+=1;
/ * this part will initiate the trackng
registers */
for(i=l;i<=g;i++)
{

if((temp6+4)->logg[i]

==

0)

(temp6+4)->logg[i] = 1;
(temp3+i)->st track[l]=l;
( t e m p 3 q i ) - > a d d r e s s
(temp2+look-ahead)->location;

i=10;

1
1
printf(" the division status is printed below in
case 6\nl1);
status print4(tempO,temp3,temp6);
division = 1;
break;
case 7:
/* the multiplication is being loaded */
printf(" the latency is available \ntt);
/ * loading the arguments into the stage mult */
for(j=O;j<=7;j++)
{

( t e m p 0 + 3 ) - > b i t s [ j l
(temp2+stack-index)->num-one[j+l] ;
( t e m p 0 + 4 ) - > b i t s [ j ]
(temp2+stack-index)-mum-two [ j+l] ;
1
( t e m p O + O ) - > b i t s [ 3 ]
(templ+matrix index)->sdirection.mult -latency
[ (tempO+O)->bits[%] ] ;
(temp0+0)->bits[6] = 0;
(temp0+0)->bits[lO] = 0;
(tempO+O)->bits[2]+=1;
/ * this part will initiate the trackng
registers */
for(i=l;i<=9;i++)
{

if ( (temp6+3)->logg[i]

==

0)

(temp6+3)->logg[i] = 1;
(temp4+i)->st track[l]=l;
( t e m p 4 q i ) - > a d d r e s s
(temp2+stack-index)->location;
i=lO;
1
1
printf (I1 the multiplication status is printed
below in case 7\nw);
status print3(ternpO,temp4,temp6);
multiplication = 1;
break;
case 8:

/* the addition is being loaded */


printf(" the latency is available \nM);
/ * loading the arguments into the stage add */
for(j=O;j<=7;j++)
{

( t e m p O + l ) - > b i t s [ j ]

(temp2+look-ahead)->num-one [ j+1] ;
( t e m p 0 + 2 ) - > b i t s [ j ]
(temp2+look-ahead)->num-two[j+l];
1
( t e m p O + O ) - > b i t s [ 3 ]
(templ+matrix index)->sdirection.mult -add
[ (temp0+0)->bits [61];
(tempO+O)->bits[2]+=1;
(temp0+0)->bits[lO] = 0;
addition = 1;
/ * this part will initiate the trackng
registers */
for(i=l;i<=g;i++)
{

if((temp6+1)->logg[i]

==

0)

(temp6+1)->logg[i] = 1;
(temp5+i)->st track[l]=l;
( t e m p 5 T i ) - > a d d r e s s
(temp2+look-ahead)->location;
i=10 ;
1

printf(" the addition status is printed below in


case 8\nw);
status printl(ternpO,temp5,temp6);
/ * the-multiplication is being loaded */
printf (It the latency is available \nI1);
/ * loading the arguments into the stage mult */
for(j=O;j<=7;j++)
{

( t e m p 0 + 3 ) - > b i t s [ j ]
(temp2+stack-index)->num-one [ j+1] ;
( t e m p 0 + 4 ) - > b i t s [ j ]
(temp2+stack-index)->num-two[j+l] ;
1
printf ("\nW);
printf ("\nW);
printf(" looking at temp0 + 3 in case 8 \nw);
printf ("\n1I);
printf ("\nW);
for(j=O;jc=7;j++)
{

printf (I1 %d 11,(temp0+3)->bits[ j ]) ;


1
printf (n\nv);
printf (I1\nn)
;
printf(It looking at temp0 + 4 in case 8 \nI1);
printf ("\nu);
printf (!'\nu);
for(j=O;j<=7;j++)

printf (It %d ",(temp0+4)->bits [ j I ) ;


1
printf ("\nu);
printf (I1\nw);
( t e m p 0 + 0 ) - > b i t s [ 3 ]
(templ+matrix index)->sdirection.mult -add
[ (temp0+0)->bits [
6
]
;
(temp0+0)->bits[6] = 0;
(temp0+0)->bits[2]+=1;
/ * this part will initiate the trackng
registers */
for(i=l;i<=g;i++)
{

if ( (temp6+3)->logg[i] == 0)
{

(temp6+3)->logg[i] = 1;
(temp4+i)->st track[l]=l;
( t e m p 4 q i ) - > a d d r e s s
(temp2+stack-index)->location;
i=lO;

1
1

printf (It the multiplication status is printed


below in case 8\nu);
status print3(tempO,temp4,temp6);
multiplication = 1;
break;
case 9:

/ * the subtraction is being loaded */


printf ( " the latency is available \nV1)
;
/* loading the arguments into the stage sub */
for(j=O; j<=7;j++)
{

( t e m p O + l ) - > b i t s [ j ]
(temp2+look-ahead)->num-one [ j+l] ;
( t e m p 0 + 2 ) - > b i t s [ j ]
(temp2+look-ahead)->num-two [ j+1] ;
1
( t e m p O + O ) - > b i t s [ 3 ]
( t e m p l + m a t r i x i n d e x ) - > s d i r e c t i o n . m u l t -add
[ (tempO+O)->bits[
6
]
;
(tempO+O)->bits[2]+=1;
(tempO+O)->bits[lO] = 0;
subtraction = 1;
/ * this part will initiate the trackng
registers */
for(i=l;i<=g;i++)
{

if((temp6+2)->logg[i]
{

==

0)

printf(" the subtraction status is printed below


in case 9\nw);
status print2(tempOftempll,temp6);
/* the-multiplication is being loaded */
printf(" the latency is available \ntt);
/ * loading the arguments into the stage mult */
for(j=O;j<=7;j++)
{

( t e m p 0 + 3 ) - > b i t s [ j ]
(temp2+stack-index)->num-one[ j+l] ;
( t e m p 0 + 4 ) - > b i t s [ j ]
(temp2+stack-index)->num-two [ j+l] ;
1
( t e m p O + O ) - > b i t s [ 3 ]
(templ+matrix index)->sdirection.mult -add
[ (tempO+O)->bits [ 6 n ;
(temp0+0)->bits[6] = 0;
(tempO+O)->bits[2]+=1;
/ * this part will initiate the trackng
registers */
for(i=l;i<=g;i++)
{

if ( (temp6+3)->logg[i] == 0)
{

(temp6+2)->logg[i] = 1;
(temp4+i)->st track[l]=l;
( t e m p 4 q i ) - > a d d r e s s
(temp2+stack-index)->location;
i=lO;

1
printf(" the multiplication status is printed
below in case 9\nn);
status print3(tempOftemp4,temp6);
m~ltiplication= 1;
break;
case 10:
the division is being loaded */
printf (lt the latency is available \n1I);
/ * loading the arguments into the stage div
for(j=O;j<=7;j++)

/*

( t e m p 0 + 3 ) - > b i t s [ j ]

*/
-

(temp2+stack-index)-mum-one[j+l]

for(j=0;j<=8;j++)

( t e m p O + O ) - > b i t s [ 3 ]
(templ+matrix index)->sdirection.div -latency
[ (temp0+0)->bits[?I
];
(temp0+0)->bits[7] = 0;
(temp0+0)->bits[10] = 0;
(tempO+O)->bits[2]+=1;
/ * this part will initiate the trackng
registers */
for(i=l;i<=g;i++)
{

if((temp6+4)->logg[i]

==

0)

(temp6+4)->logg[i] = 1;
(temp3+i)->st track[l]=l;
( t e m p 3 7 i ) - > a d d r e s s
(temp2+stack-index)->location;
i=lO ;
1
1
printf(I1 the division status is printed below in
case 10\nI1);
status print4(tempO,temp3,temp6);
division = 1;
break;
case 11:

/* the addition is being loaded */


printf (I1 the latency is available \nP1)
;
/* loading the arguments into the stage add */
for(j=0;j<=7;j++)
{

( t e m p O + l ) - > b i t s [ j ]
(temp2+(look-ahead+l)) - m u m one [j+l];
( t e m p o - + 2 ) - > b i t s [ j ]
(temp2+(look-ahead+l)) -mum-two[j+l] ;
1
( t e m p O + O ) - > b i t s [ 3 ]
templ+matrix index)->sdirection.div -add
(temp0+0)->bits[7]7;
re adjust = 1;
ifTre-adjust == 1)
{

printf (Ig\n1l)
;
printf ( " re-adjust has been assigned one
in case 1l\nl1);

printf (l1\nl1)
;

registers

*/

1
addition = 1;
/ * this part will initiate the trackng

for(i=l;i<=g;i++)
{

if((temp6+1)->logg[i]

==

0)

(temp6+1)->logg[i] = 1;
(temp5+i)->st track[l]=l;
( t e m p 5 T i ) - > a d d r e s s
(temp2+(look-ahead+l))->location;
i=10;
1
1
printf(" the addition status is printed below in
case ll\nH);
status printl(ternpO,temp5,temp6);
/* the-division is being loaded */
printf (I1 the latency is available \nI1);
/ * loading the arguments into the stage div */
for(j=O;j<=7;j++)
{

( t e m p 0 + 3 ) - > b i t s [ j ]
(temp2+stack-index)->num-one[j+l] ;
1
for(j=O;j<=8;j++)

( t e m p 0 + 4 ) - > b i t s [ j ]
(temp2+stack-index)->num-two [ j] ;
1
( t e m p 0 + 0 ) - > b i t s [ 3 ]
(templ+matrix-index)->sdirection.div-add
[ (temp0+0)->bits[7] 1 ;
(tempO+O)->bits[7] = 0;
(tempO+O)->bits[10] = 0;
(tempO+O)->bits[2]+=1;
/ * this part will initiate the trackng
registers */
for(i=l;i<=g;i++)
{

if((temp6+4)->logg[i]
{

== 0)

(temp6+4)->logg[i] = 1;
(temp3+i)->st track[l]=l;
( t e m p 3 T i ) - > a d d r e s s
(temp2+stack-index)->location;
i=lO ;
1
1
printf(I1 the division status is printed below in

case ll\nw);
status print4(tempOtternp3,temp6);
division = 1;
break;
case 12:

/* the subtraction is being loaded */


printf(" the latency is available \nw);
/* loading the arguments into the stage

sub

*/

for(j=O;j<=7;j++)
{

( t e m p O + l ) - > b i t s [ j ]
(temp2+(look-ahead+l)) ->num one[j+l] ;
( t e m p o e + 2 ) - > b i t s [ j ]
(temp2+(look-ahead+l)) ->num-two[j+l] ;
1
( t e m p O + O ) - > b i t s [ 3 ]
(templ+matrix index)->sdirection.div -add
[ (tempO+O)->bits[7]>
re adjust = 1;
if(re -adjust == 1)
{

printf ("\nW);
printf ( " re-adjust has been assigned one
;
in case 12\ntt)

registers

*/

printf ("\ntt)
;
1
subtraction = 1;
/ * this part will initiate the trackng
for(i=l;i<=g;i++)
{

if((temp6+2)->logg[i]

==

0)

(temp6+2)->logg[i] = 1;
(templl+i)->st track[l]=l;
( t e m p l l ? i ) - > a d d r e s s
(temp2+(look-ahead+l))->location;
i=lO;
1
1
printf(" the subtraction status is printed below
in case 12\11"):
status print2(tempOttempll,temp6);
/* thedivision is being loaded */
printf ( " the latency is available \ntt)
;
/* loading the arguments into the stage div */
for(j=O;j<=7;j++)
{

( t e m p 0 + 3 ) - > b i t s [ j ]
(temp2+stack-index)->num-one[j+l];
1

for(j=O;j<=8;j++)
1
( t e m p 0 + 4 ) - > b i t s [ j ]
(temp2+stack-index)->num-two[j];
1
( t e m p O + O ) - > b i t s [ 3 ]
(templ+matrix index)->sdirection.div -add
[ (tempO+O)->bits[7]>
(temp0+0)->bits [7] = 0;
(tempO+O)->bits[10] = 0;
(tempO+O)->bits[2]+=1;
/ * this part will initiate the trackng
registers */
for(i=l;i<=g;i++)
{

if((temp6+4)->logg[i]

==

0)

1
printf(" the division status is printed below in
case 12\11");
status print4(ternpO,temp3,temp6);
division = 1;
break;
case 0:
if(de1ta -flag

== 0)

printf(I1 the additional entry = 0 and delta flag


=

0\n1l);

/*

init the input reg to the pipe to 0


init key = 1;
for(T=o;j<=8;j++)
{

(tempO+l)->bits [ j ] =
(temp0+2)->bits [ j ] =
(temp0+3)->bits[ j] =
(temp0+4)->bits [ j ] =

1
1
break;

1
1
return (*numO);

0;
0;
0;

0;

*/

.................................
/****** Function Set Logg ******/
/*******************%***********/
struct6 set logg(num1)
struct6 *numl;
{

int iljlkfl;
struct6 *tempi;
temp1 = numl;
/* this functions checks to see */
/ * wether any of the process loggs
/ * are empty */
for(i=l;i<=E~;i++)
{

*/

k = 0;
for(j=O;j<=9;j++)
(

1
if (k
(

k = k I
==

(templ+i)->logg[j ]

0)

(templ+i)->logg stat = 0;
printf ( " logg stat is made 0 \nll);

1
else
{

(templ+i)->logg stat = 1;
printf ( " logg stat is made 1 \nw);

1
1
return (*numl);

main ( )
{

int one,twolifj,kll,mfv;
FILE *inptr;
FILE *read ptr;
int i~dexftestlx,enough;
index = 0;
test = 0;
x = 0;
enough = 0;
memory ptr = &memory;
dstacki ptr = &decode stackl;
dstack2-ptr = &decode-stack2;
ilatch ptr = &iunit latches;
inhold-ptr
= &internal-holders;

statusu-ptr = &status unit;


gp-ptr = &gp-register?
sr ptr = &register sr;
pgmptrl = &pgm counterl;
pgm_ptr2 = &pgm-counter2;
isunit ptr = &isunit latch;
sstatug ptr = &stream status;
picstatus-ptr = &picqueue status;
eacstatus-ptr = &eacqueue-status;
(sstatus ptr)->picqueue full
(sstatus-ptr)->eacqueue-full

/ * enter-the instruction
par pointer = &par product;
argi-pointer = &argumentl;

*/

=
=

0;
0;

arg2 pointer = &argument2;


lat pointer = &latches;
trans pointer = &transfer;
delay-ptr = &delay;
mpreg-ptr = &multipurpose-reg;
divflow ptr = &div follow;
deltaflow ptr = &delta track;
multflow ptr = &mult follow;
addflow ptr = &add follow;
subflow-ptr = &sub-follow;
prlogg-ptr = &process logg;
prstack-ptr = &priority stack;
copy one = argl pointer?
copy-two = arg2pointer;
copy-three = par pointer;
copy-four = lat pointer;
copy-five = trans pointer;
copy-six = delay ptr;
/ * intialising the pointers to the variables
for(i=l;i<=20;i++)
{

ptr argmntl[i] = &arg one[i][O];


ptr-argmnt2[i]
= &arg-two[i][O];
-

1
ptr op = &op-code;
instack ptr = &input stack;
outstack ptr = &output stack;
bin pointer = &binary matrix;
/ * Xnitialising all the flags to zero
printf ("initialising the flags \nl1);
for(i=O;i<=8;i++)
{

(mpreg-ptr+O) ->bits[i]

0;

*/

*/

1
(mpreg ptr+O)->bits [3] = 3 ;
(mpregAptr+O)->bits [2] = 1;
(mpreg-ptr+0)->bits[o] = 2 ;
re-adjust = 0;
/ * reading in of the instruction stack and control
structures */
/ * reading of control.dat */
printf(" enter the number of instructions in the stack \nu);
s ~ a n f ( ~ % d ~ , & sptr);
tk
inptr = fopen("control.datw, "rW);
if (inptr == (FILE *)NULL)
{

printf (I1 error in reading operation \nV1)


;
exit (1);

1
fread (bin-pointer, sizeof(struct collision-matrix), 89,
inptr) ;
fclose (inptr);
/ * reading of the instr.dat */
/ * reading of the instruction stack */
read ptr = fopen ("instr.datI1, "rgl)
;
if(read-ptr == (FILE *)NULL)
{

p r i n t f (I1 e r r o r i n r e a d i n g o p e r a t i o n f o r
instr.dat\nI1);
exit (1);
1
for(i=l;i<=stk-ptr;i++)
{

fscanf (read ptr,"\nI1);


fscanf(readptrIW %d\t ll,&op
-code[i]);
for(j=l;j<=8;j++)

fscanf(read-ptrIW\tu);
for(j=l;j<=8;j++)
{

fscanf (read-ptr,

%d It,
&arg-two [ i] [ j])

fscanf (read ptr,"\ntl)


;
fscanf (read-ptr,
"\nll);

1
fclose (read ptr) ;
/ * printing-of the instruction stack */
printf (
the instruction stack is printed below \nt1)
;
for(i=l;i<=stk-ptr;i++)
{

printf (I1\nw)
;
printf (I1 %d\t ",op-code [ i] )
for(j=l;j<=8;j++)
{

printf ( " %d I1,arg-one[i][j]) ;

printf (n\tll)
;
for(j=l;j<=8;j++)
{

/*

printf (I1 %d n,arg-two[i] [j]) ;


printf ("\nl1);
printf (I1\nt1)
;

printf(

below \n1I);

the various structures are tabulated

printf (I1\nn)
;
for (v=l; v <= 88; v++)
{

for (1=0;1<8;++1)
{

printf
",binary-matrix[v].smatrix.bits rowl[l]);
I
printf (l1\nv1)
;
printf (I1\n1l)
;
for (1=0;1<8;++1)

I1%d \b

It%d \b

printf (
",binary-matrix[v].smatrix.bits -row3[1]);

I1%d \ b

printf
n,binary-matrix[v].smatrix.bits -row2[1]);
printf (ll\nll)
;
printf (I1\nl1)
;

1
for (1=0;1<8;++1)
{

printf (I1\nM)
;
printf (I1\nl1)
;
printf (I1\nn)
;
for (1=0;1<8;++1)
{

p r i n t f ( It%d \b
,binary-matrix[v].sdirection.div -latency[l]);

It

printf ("\nl1);
printf ("\nW);
for (1=0;1<8;++1)
{

printf

I1%d \b

k
fd
k
C,

.. a.

=h rC,l *-

F:

-4

C,

F:

* - F : F:r"': .-

+'a,&/
aN fd
I
IU
Xrl k 0
UfdUk
fd -4 a a

C,C,

m -4 a Q) -4
+ C O k
a)
7 - 4 la k
O
k
-dc,d
da
F+
C,5 - 7
7 +
k 0.-0-4
IkC, k - k
*-

a o

*- u

UC, IC,C,c,cJ
f d X P I I I

*dad k d 0

2~ ; m:zX1;:
~ m

-4mmmfdmrl

i)->num-one[j])

printf

(If

%d If,(instack-ptr +

printf (I1\ntf)
;
printf(I1the value of argument two
{

i)->num two[j])

printf

%d

(If

'I,

is as follows

(instack-ptr +

printf (I1\nw)
;

i=O ;
while ( (mpreg-ptr+0) ->bits[2]<=13)
f

/*

initialising the pointers


(pgm-ptrl)->counter[O] = 1;
(pgm-ptr2)->counter[O] = 0;

/*

the T ON cycle

*/

*/

fetch unit(memory ptr,inh0ld-ptr,pgm-ptrl~pgm~ptr2~


picstatus-ptr, eacstatus-ptr) ;
load pipeline ( m p r e g p t r , b i n pointer,instack ptr,
divf'iow-ptr, multfloi p t r , a d d f l o w - p t r , p r l o g g w ,
varl,var2,var3,prstack
-pt~,subflow
-ptr,deltaflow-ptr);
pipeline ( ) ;
set-logg (prlogg-ptr) ;

/*

the T OFF cycle

*/

time off ( ) ;
output check(trans pointer,prstack ptr,
outstack~ptr,divflowptr,multflow ptr,addflow-ptr,
prlogg ptr,subflow-ptr,deltaflow-ptr,mfieg-ptr) ;
shift-track(divf1ow p t r , m u l t f l o w - p t r , a d d f l o w - p t r ,
prlogg-ptr,
subflow-ptrydeltaflow-ptr) ;
1
}

Anda mungkin juga menyukai