Anda di halaman 1dari 72

e-notes by Prof.H.V.

Ravish Aradhya, RVCE, Bangalore

COMPUTER ORGANIZATION (CS 46)


Chapter 5: ARITHMETIC:
The basic operation needed for all arithmetic operations in a digital computer is the addition of two numbers. Subtraction can be achieved through addition & complement operations, multiplication through repeated addition & division can be achieved through repeated subtraction. All arithmetic operations & logical operations are implemented in the arithmetic and logic unit (ALU) of a processor. It can be shown that basic building block of all arithmetic & logic operation (ALU) is a parallel adder. That means, all arithmetic operations and basic logic functions such as AND, OR, NOT, and EXCLUSIVE-OR (XOR) are implemented using a parallel adder and additional combinational circuits. The time needed to perform an addition operation affects the processor's performance, similarly multiply and divide operations, which require more complex circuitry, also affect processor performance. It is therefore necessary to design some of the advanced techniques to perform arithmetic and logical operations at a very high speed. Compared with arithmetic operations, logic operations are simple to implement using combinational circuitry. They require only independent Boolean operations on individual bit positions of the operands, whereas carry/borrow lateral signals are required in arithmetic operations. It is already observed that 2's-complement form of representing a signed binary number is the best representation from the point of performing addition and subtraction operations. The examples already used show that two, n-bit, signed numbers can be added using n-bit binary addition, treating the sign bit the same as the other bits. In other words, a logic circuit that is designed to add unsigned binary numbers can also be used to add signed numbers in 2s-complement. If overflow does not occur, the sum is correct, and any output carry can be ignored. If overflow occurs, then the sum is to be corrected by taking 2s complement of output.

ADDITION AND SUBTRACTION OF SIGNED NUMBERS:


In figure-1, the function table of a full-adder is shown; sum and carryout are the outputs for adding equally weighted bits xi and yi, in two numbers X and Y. The logic expressions for these functions are also shown, along with an example of addition of the 4-bit unsigned numbers 7 and 6. Note that each stage of the addition process must accommodate a carry-in bit. We use ci, to represent the carry-in to the i th stage, which is the same as the carryout from the (i - 1) th stage. The logic expression for si in Figure-1 can be implemented with a 3-input XOR gate.
1

The carryout function, ci +1 is implemented with a two-level AND-OR logic circuit. A convenient symbol for the complete circuit for a single stage of addition, called a full adder (FA), is as shown in the figure-1a. A cascaded connection of such n full adder blocks, as shown in Figure 1b, forms a parallel adder & can be used to add two n-bit numbers. Since the carries must propagate, or ripple, through this cascade, the configuration is called an n-bit ripplecarry adder. The carry-in, Co, into the least-significant-bit (LSB) position [Ist stage] provides a convenient means of adding 1 to a number. Take for instance; forming the 2'scomplement of a number involves adding 1 to the 1s-complement of the number. The carry signals are also useful for interconnecting k adders to form an adder capable of handling input numbers that are kn bits long, as shown in Figure-1c.

FIG-1: Addition of binary vectors.

FIG-2: 4 - Bit parallel Adder. Addition/Subtraction Logic Diagram:


The 4-bit adder shown in Figure 2 can be used to add 2's-complement numbers X and Y, where the xn-1 and yn-1 bits are the sign bits. In this case, the carry-out bit, cn is not part of the answer. In an addition, overflow can only occur when the signs of the two operands are the same. In this case, overflow obviously occurs if the sign of the result is different. Therefore, a circuit to detect overflow can be added to the n-bit adder by implementing the logic expression Overflow Also, an overflow can be detected by using the carry bits cn and cn-1. An overflow occurs if cn and cn-1 are different. Therefore, a much simpler alternative circuit for detecting an overflow can be obtained by implementing the expression cn cn - 1 with an XOR gate.
3

Subtraction operation on two numbers X & Y can be performed using 2's-complement method. In order to perform X - Y, we form the 2's-complement of Y and add it to X. The logic circuit shown in Figure-3 can be used to perform either addition or subtraction based on the value applied to the Add/Sub input control line. This line is set to 0 for addition, applying the Y vector unchanged to one of the adder inputs along with a carry-in signal, co. The control input along with the associated ex-or gates either inverts or applies Y vector as it is for subtraction or addition respectively, & hence it is called as control inverter circuit.

FIG-3: Adder/Subtractor network.


The design of an adder/sub tractor circuit can be illustrated as follows; consider the parallel adder shown in figure 4a. Here, the B inputs are complemented and added to A inputs along with a carry for subtraction. For addition A & B inputs are added without changing & with no carry. Figure 4b indicates the necessary arrangement for an adder/subtractor. It is required to design the combinational circuit which recives A, B & S inputs and produces inputs for the full adder. Using K-map simplification for the table, the equations for Xi, Yi & Ci are obtained as indicated. Implementing these equations results in figure3.
A B FIG-4a

S Ai Bi

Parallel Adder
CombinationalB + 1 S=A+ Circuit

Ci Cin =1 Xi FA Yi Ci+1
4

Fi

FIG-4b FIG-4: Design of Adder/Subtractor.

Truth Table:

S X Yi Cin i 0 A Bi 0 i 1 A Bi 1 i

S 0 0 0 0 1 1 1 1

Ai 0 0 1 1 0 0 1 1

Bi 0 1 0 1 0 1 0 1

Xi 0 0 1 1 0 0 1 1

Yi 0 1 0 1 1 0 1 0

Xi = Ai Yi = Bi Cin = S S

For addition Cin = 0 & for subtraction Cin = 1 along with complement of B. Hence, add/sub control line is connected to Cin. When the Add/Sub control line is set to 1, the Y vector is 1's-complemented (that is, bit complemented) by the XOR gates and co is set to 1 to complete the 2's-complementation of Y. Remember that 2'scomplementing a negative number is done in exactly the same manner as for a positive number. An XOR gate can be added to Figure 3 to detect the overflow condition cn
5

n-1. As listed in the truth table Yi is equal to Bi when s = 0 & it is equal to c

complement of Bi when the control S = 1. Using K-maps the expressions for Xi & Yi can be obtained & implemented as in fig-3.

Design of Fast Adders:


In an n-bit parallel adder (ripple-carry adder), there is too much delay in developing the outputs, so through sn-1 and cn. On many occasions this delay is not acceptable; in comparison with the speed of other processor components and speed of the data transfer between registers and cache memories. The delay through a network depends on the integrated circuit technology used in fabricating the network and on the number of gates in the paths from inputs to outputs (propagation delay). The delay through any combinational logic network constructed from gates in a particular technology is determined by adding up the number of logic-gate delays along the longest signal propagation path through the network. In the case of the n-bit ripplecarry adder, the longest path is from inputs x 0, y0, and c0 at the least-significant-bit (LSB) position to outputs cn and sn-1 at the most-significant-bit (MSB) position. Using the logic implementation indicated in Figure-1, cn-1 is available in 2(n1) gate delays, and sn-1 is one XOR gate delay later. The final carry-out, cn is available after 2n gate delays. Therefore, if a ripple-carry adder is used to implement the addition/subtraction unit shown in Figure-3, all sum bits are available in 2n gate delays, including the delay through the XOR gates on the Y input. Using the implementation cn n-1 for overflow, this indicator is available after 2n+2 gate c delays. In summary, in a parallel adder an nth stage adder can not complete the addition process before all its previous stages have completed the addition even with input bits ready. This is because, the carry bit from previous stage has to be made available for addition of the present stage. In practice, a number of design techniques have been used to implement highspeed adders. In order to reduce this delay in adders, an augmented logic gate network structure may be used. One such method is to use circuit designs for fast propagation of carry signals (carry prediction).

Carry-Look ahead Addition:


As it is clear from the previous discussion that a parallel adder is considerably slow & a fast adder circuit must speed up the generation of the carry signals, it is necessary to make the carry input to each stage readily available along with the input bits. This can be achieved either by propagating the previous carry or by generating a carry depending on the input bits & previous carry. The logic expressions for si (sum) and c i+1 (carry-out) of stage ith are

The above expressions Gi and Pi are called carry generate and propagate functions for stage i. If the generate function for stage i is equal to 1, then ci+1 = 1, independent of the input carry, c i. This occurs when both xi and yi are 1. The propagate function means that an input carry will produce an output carry when either x i or yi or both equal to 1. Now, using Gi & Pi functions we can decide carry for ith stage even before its previous stages have completed their addition operations. All G i and Pi functions can be formed independently and in parallel in only one gate delay after the Xi and Yi inputs are applied to an n-bit adder. Each bit stage contains an AND gate to form Gi, an OR gate to form Pi and a three-input XOR gate to form si. However, a much simpler circuit can be derived by considering the propagate function as P i = xi i, which differs from Pi = xi + yi only when xi = yi =1 where Gi = 1 (so it does not y matter whether Pi is 0 or 1). Then, the basic diagram in Figure-5 can be used in each bit stage to predict carry ahead of any stage completing its addition. Consider the ci+1expression,

This is because, Ci = (Gi-1 + Pi-1Ci-1). Further, Ci-1 = (Gi-2 + Pi-2Ci-2) and so on. Expanding in this fashion, the final carry expression can be written as below;

C i+1 = Gi + PiG i-1 + PiP i-1 G i-2 + + Pi P i-1 P 1G0 + Pi P i-1 P0G0
Thus, all carries can be obtained in three gate delays after the input signals Xi, Yi and Cin are applied at the inputs. This is because only one gate delay is needed to develop all Pi and Gi signals, followed by two gate delays in the AND-OR circuit (SOP expression) for ci + 1. After a further XOR gate delay, all sum bits are available. Therefore, independent of n, the number of stages, the n-bit addition process requires only four gate delays.

FIG-5: 4 bit carry look ahead adder.


Now, consider the design of a 4-bit parallel adder. The carries can be implemented as

;i = 0 ;i = 1 ;i = 2 ;i = 3
The complete 4-bit adder is shown in Figure 5b where the B cell indicates Gi, Pi & Si generator. The carries are implemented in the block labeled carry look-ahead logic. An adder implemented in this form is called a carry look ahead adder. Delay through the adder is 3 gate delays for all carry bits and 4 gate delays for all sum bits. In comparison, note that a 4-bit ripple-carry adder requires 7 gate delays for S3(2n-1) and 8 gate delays(2n) for c4. If we try to extend the carry look- ahead adder of Figure 5b for longer operands, we run into a problem of gate fan-in constraints. From the final expression for Ci+1 & the carry expressions for a 4 bit adder, we see that the last AND gate and the OR gate require a fan-in of i + 2 in generating cn-1. For c4 (i = 3)in the 4-bit adder, a
8

fan-in of 5 is required. This puts the limit on the practical implementation. So the adder design shown in Figure 4b cannot be directly extended to longer operand sizes. However, if we cascade a number of 4-bit adders, it is possible to build longer adders without the practical problems of fan-in. An example of a 16 bit carry look ahead adder is as shown in figure 6. Eight 4-bit carry look-ahead adders can be connected as in Figure-2 to form a 32-bit adder.

FIG-6: 16 bit carry-look ahead adder. MULTIPLICATION OF POSITIVE NUMBERS:


Consider the multiplication of two integers as in Figure-6a in binary number system. This algorithm applies to unsigned numbers and to positive signed numbers. The product of two n-digit numbers can be accommodated in 2n digits, so the product of the two 4-bit numbers in this example fits into 8 bits. In the binary system, multiplication by the multiplier bit 1 means the multiplicand is entered in the appropriate position to be added to the partial product. If the multiplier bit is 0, then 0s are entered, as indicated in the third row of the shown example.

1 1 0 1 X 1 0 11 1 1 0 1 11 0 1 0 00 0 11 01 100 01 1 1 1
Binary multiplication of positive operands can be implemented in a combinational (speed up) two-dimensional logic array, as shown in Figure 7. Here, Mindicates multiplicand, Q- indicates multiplier & P- indicates partial product. The basic component in each cell is a full adder FA. The AND gate in each cell determines whether a multiplicand bit mj, is added to the incoming partial-product bit, based on the value of the multiplier bit, qi. For i in the range of 0 to 3, if qi = 1, add the multiplicand (appropriately shifted) to the incoming partial product, PPi, to generate the outgoing partial product, PP(i+ 1) & if q i = 0, PPi is passed vertically downward unchanged. The initial partial product PP O is all 0s. PP4 is the desired product. The multiplicand is shifted left one position per row by the diagonal signal path. Since the multiplicand is shifted and added to the partial product depending on the multiplier bit, the method is referred as SHIFT & ADD method. The multiplier array & the components of each bit cell are indicated in the diagram, while the flow diagram shown explains the multiplication procedure.

FIG-7a
10

P7, P6, P5,,P0 product.

FIG-7b
The following SHIFT & ADD method flow chart depicts the multiplication logic for unsigned numbers.

11

Despite the use of a combinational network, there is a considerable amount of delay associated with the arrangement shown. Although the preceding combinational multiplier is easy to understand, it uses many gates for multiplying numbers of practical size, such as 32- or 64-bit numbers. The worst case signal propagation delay path is from the upper right corner of the array to the high-order product bit output at the bottom left corner of the array. The path includes the two cells at the right end of each row, followed by all the cells in the bottom row. Assuming that there are two gate delays from the inputs to the outputs of a full adder block, the path has a total of 6(n - 1) - 1 gate delays, including the initial AND gate delay in all cells, for the n x n array. In the delay expression, (n-1) because, only the AND gates are actually needed in the first row of the array because the incoming (initial) partial product PPO is zero. Multiplication can also be performed using a mixture of combinational array techniques (similar to those shown in Figure 7) and sequential techniques requiring less combinational logic. Multiplication is usually provided as an instruction in the machine instruction set of a processor. High-performance processor (DS processors) chips use an appreciable area of the chip to perform arithmetic functions on both integer and floating-point operands. Sacrificing an area on-chip for these arithmetic circuits increases the speed of processing. Generally, processors built for real time applications have an on-chip multiplier.

FIG-8a
12

Another simplest way to perform multiplication is to use the adder circuitry in the ALU for a number of sequential steps. The block diagram in Figure 8a shows the hardware arrangement for sequential multiplication. This circuit performs multiplication by using single n-bit adder n times to implement the spatial addition performed by the n rows of ripple-carry adders. Registers A and Q combined to hold PPi while multiplier bit qi generates the signal Add/No-add. This signal controls the addition of the multiplicand M to PPi to generate PP(i + 1). The product is computed in n cycles. The partial product grows in length by one bit per cycle from the initial vector, PPO, of n 0s in register A. The carry-out from the adder is stored in flip-flop C. To begin with, the multiplier is loaded into register Q, the multiplicand into register M and registers C and A are cleared to 0. At the end of each cycle C, A, and Q are shifted right one bit position to allow for growth of the partial product as the multiplier is shifted out of register Q. Because of this shifting, multiplier bit qi, appears at the LSB position of Q to generate the Add/No-add signal at the correct time, starting with qo during the first cycle, q1 during the second cycle, and so on. After they are used, the multiplier bits are discarded by the right-shift operation. Note that the carry-out from the adder is the leftmost bit of PP(i + 1), and it must be held in the C flip-flop to be shifted right with the contents of A and Q. After n cycles, the high-order half-of- the product is held in register A and the low-order half is in register Q. The multiplication example used above is shown in Figure 8b as it would be performed by this hardware arrangement.

M
1 10 1 0 0 00 0 1 0 1 1 Initial Configuration Add Shift Add Shift No Add Shift Add Shift I cycle II cycle III cycle IV cycle

C
0 0 1 0 0 0 1 0

A
1 1 0 1 0 1 1 0 0 0 1 1 1 0 0 1 1 0 0 1 0 1 0 0 0 0 0 1 1 0 0 0

Q
1 0 1 1 1 1 0 1 1 1 0 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 Product

FIG-7b

FIG-8b
13

Using this sequential hardware structure, it is clear that a multiply instruction takes much more time to execute than an Add instruction. This is because of the sequential circuits associated in a multiplier arrangement. Several techniques have been used to speed up multiplication; bit pair recoding, carry save addition, repeated addition, etc.

SIGNED-OPERAND MULTIPLIATION:
Multiplication of 2's-complement signed operands, generating a double-length product is still achieved by accumulating partial products by adding versions of the multiplicand as decided by the multiplier bits. First, consider the case of a positive multiplier and a negative multiplicand. When we add a negative multiplicand to a partial product, we must extend the sign-bit value of the multiplicand to the left as far as the product will extend. In Figure 9, for example, the 5-bit signed operand, - 13, is the multiplicand, and +11, is the 5 bit multiplier & the expected product -143 is 10-bit wide. The sign extension of the multiplicand is shown in red color. Thus, the hardware discussed earlier can be used for negative multiplicands if it provides for sign extension of the partial products.

0 0 1 1 (-13) X 111111 0 0 111110 0 1 000000 0 0 111001 1 000000 110111 0 0

0 1 0 1 1 (+11) 1 1 1

0 1

(-143) FIG-9

FIG-8 For a negative multiplier, a straightforward solution is to form the 2'scomplement of both the multiplier and the multiplicand and proceed as in the case of a positive multiplier. This is possible because complementation of both operands does not change the value or the sign of the product. In order to take care of both negative and positive multipliers, BOOTH algorithm can be used.

Booth Algorithm
The Booth algorithm generates a 2n-bit product and both positive and negative 2's-complement n-bit operands are uniformly treated. To understand this algorithm, consider a multiplication operation in which the multiplier is positive and has a single block of 1s, for example, 0011110(+30). To derive the product, as in the normal standard procedure, we could add four appropriately shifted versions of the
14

multiplicand,. However, using the Booth algorithm, we can reduce the number of required operations by regarding this multiplier as the difference between numbers 32 & 2 as shown below;

0 1 0 0 0 0 0 (32) 0 0 0 0 0 1 0 (-2) 0 0 1 1 1 1 0 (30)


This suggests that the product can be generated by adding 2 5 times the multiplicand to the 2's-complement of 21 times the multiplicand. For convenience, we can describe the sequence of required operations by recoding the preceding multiplier as 0 +1000 - 10. In general, in the Booth scheme, -1 times the shifted multiplicand is selected when moving from 0 to 1, and +1 times the shifted multiplicand is selected when moving from 1 to 0, as the multiplier is scanned from right to left.

0 1 0 1 1 0 1 0 0+1 +1 +1 +1 0 0 0 0 0 0 0 0 0 1 0 1 1 0 1 0 1 0 1 1 0 1 0 1 0 1 1 0 1 01 0 1 1 0 1 0 00 0 0 0 0 00 00 0 0 0 00 010 1 0 1 0 0 01 1 0 FIG-9a 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 1 FIG-10a: Normal Multiplication 0 1 0 1 1 0 1 0 0+1 +1 +1 +1 0 0 0 00 0 0 0 0 0 0 1 1 1 01 0 0 1 1 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 00 0 0 1 1 01 0 0 0 0 0 1 0 10 0 0 1 1 0 FIG-10b: Booth Multiplication

FIG-9a Figure 10 illustrates the normal and the Booth algorithms for the said example. The Booth algorithm clearly extends to any number of blocks of 1s in a multiplier, including the situation in which a single 1 is considered a block. See Figure 11a for another example of recoding a multiplier. The case when the least significant bit of the multiplier is 1 is handled by assuming that an implied 0 lies to its right. The Booth
15

algorithm can also be used directly for negative multipliers, as shown in Figure 11a. To verify the correctness of the Booth algorithm for negative multipliers, we use the following property of negative-number representations in the 2's-complement

FIG-11a

FIG-11b

16

FIG - 12: Example of BOOTHs Algorithm.


then the top number is the 2's-complement representation of -2k+l. The recoded multiplier now consists of the part corresponding to the second number, with - 1 added in position k+1. For example, the multiplier 110110 is recoded as 0-1+10-10.
17

The Booth technique for recoding multipliers is summarized in Figure 13a. The transformation 011... 110 => +100... .0 -10 is called skipping over Is. This term is derived from the case in which the multiplier has its 1 s grouped into a few contiguous blocks. Only a few versions of the shifted multiplicand (the summands) must be added to generate the product, thus speeding up the multiplication operation. However, in the worst case that of alternating 1 s and 0s in the multiplier each bit of the multiplier selects a summand. In fact, this results in more summands than if the Booth algorithm were not used. A 16-bit, worst-case multiplier, an ordinary multiplier, and a good multiplier are shown in Fig 13a. Fig 13b is the flow chart to explain the Booth algorithm, The Booth algorithm has two attractive features. First, it handles both positive and negative multipliers uniformly. Second, it achieves some efficiency in the number of additions required when the multiplier has a few large blocks of 1 s. The speed gained by skipping over 1s depends on the data. On average, the speed of doing multiplication with the Booth algorithm is the same as with the normal algorithm.

FIG 13a: Booth recoded multipliers.

FAST MULIPLICATION:
There are two techniques for speeding up the multiplication operation. The first technique guarantees that the maximum number of summands (versions of the multiplicand) that must be added is n/2 for n-bit operands. The second technique reduces the time needed to add the summands (carry-save addition of summands method).

18

FIG 13a: Booth Algorithm Flow chart. Bit-Pair Recoding of Multipliers:


This bit-pair recoding technique halves the maximum number of summands. It is derived from the Booth algorithm. Group the Booth-recoded multiplier bits in pairs, and observe the following: The pair (+1 -1) is equivalent to the pair (0 +1). That is, instead of adding 1 times the multiplicand M at shift position i to + 1 x M at position i + 1, the same result is obtained by adding +1 x M at position I Other examples are: (+1 0) is equivalent to (0 +2),(-l +1) is equivalent to (0 1). and so on. Thus, if the Boothrecoded multiplier is examined two bits at a time, starting from the right, it can be rewritten in a form that requires at most one version of the multiplicand to be added to the partial product for each pair of multiplier bits. Figure 14a shows an example of bit-pair recoding of the multiplier in Figure 11, and Figure 14b shows a table of the multiplicand

19

FIG - 14 selection decisions for all possibilities. The multiplication operation in figure 11a is shown in Figure 15. It is clear from the example that the bit pair recoding method requires only n/2 summands as against n summands in Booths algorithm.

20

FIG 15: Multiplication requiring n/2 summands.

INTEGER DIVISION:
Positive-number multiplication operation is done manually in the way it is done in a logic circuit. A similar kind of approach can be used here in discussing integer division. First, consider positive-number division. Figure 16 shows examples of decimal division and its binary form of division. First, let us try to divide 2 by13, and it does not work. Next, let us try to divide 27 by 13. Going through the trials, we enter 2 as the quotient and perform the required subtraction. The next digit of the dividend, 4, is brought down, and we finish by deciding that 13 goes into 14 once and the remainder is 1. Binary division is similar to this, with the quotient bits only 0 and 1. A circuit that implements division by this longhand method operates as follows: It positions the divisor appropriately with respect to the dividend and performs a subtraction. If the remainder is zero or positive, a quotient bit of 1 is determined, the
21

remainder is extended by another bit of the dividend, the divisor is repositioned, and sub- traction is performed. On the other hand, if the remainder is negative, a quotient bit of 0 is determined, the dividend is restored by adding back the divisor, and the divisor H repositioned for another subtraction

FIG - 16

22

FIG 17: Binary Division

23

FIG 18: Restoring Division

Restoring Division:
Figure 17 shows a logic circuit arrangement that implements restoring division. Note its similarity to the structure for multiplication that was shown in Figure 8. An n-bit positive divisor is loaded into register M and an n-bit positive dividend is loaded into register Q at the start of the operation. Register A is set to 0. After the division is complete, the n-bit quotient is in register Q and the remainder is in register A. The required subtractions are facilitated by using 2's-complement arithmetic. The extra bit position at the left end of both A and M accommodates the sign bit during subtractions. The following algorithm performs restoring division. Do the following n times: 1. Shift A and Q left one binary position. 2. Subtract M from A, and place the answer back in A. 3. If the sign of A is 1, set q 0 to 0 and add M back to A (that is, restore A); otherwise, set q0to 1. Figure 18 shows a 4-bit example as it would be processed by the circuit in Figure 17.

No restoring Division:
The restoring-division algorithm can be improved by avoiding the need for restoring A after an unsuccessful subtraction. Subtraction is said to be unsuccessful if the result is negative. Consider the sequence of operations that takes place after the subtraction operation in the preceding algorithm. If A is positive, we shift left and subtract M, that is, we perform 2A - M. If A is negative, we restore it by performing A + M, and then we shift it left and subtract M. This is equivalent to performing 2A + M. The q 0 bit is appropriately set to 0 or 1 after the correct operation has been performed. We can summarize this in the following algorithm for no restoring division. Step 1: Do the following n times: 1.If the sign of A is 0, shift A and Q left one bit position and subtract M from A; otherwise, shift A and Q left and add M to A. 2. Now, if the sign of A is 0, set q 0 to 1; otherwise, set q0 to 0. Step 2: If the sign of A is 1, add M to A. Step 2 is needed to leave the proper positive remainder in A at the end of the n cycles of Step 1. The logic circuitry in Figure 17 can also be used to perform this algorithm. Note that the Restore operations are no longer needed, and that exactly one Add or Subtract operation is performed per cycle. Figure 19 shows how the division example in Figure 18 is executed by the no restoring-division algorithm. There are no simple algorithms for directly performing division on signed operands that are comparable to the algorithms for signed multiplication. In division, the
24

operands can be preprocessed to transform them into positive values. After using one of the algorithms just discussed, the results are transformed to the correct signed values, as necessary.

FIG 19: Non-restoring Division

Floating-Point Numbers and Operations:


Floating point arithmetic is an automatic way to keep track of the radix point. The discussion so far was exclusively with fixed-point numbers which are considered as integers, that is, as having an implied binary point at the right end of the number. It is also possible to assume that the binary point is just to the right of the sign bit, thus representing a fraction or any where else resulting in real numbers. In the 2'scomplement system, the signed value F, represented by the n-bit binary fraction B = b0.b - 1b -2 ..b-(n-1) is given by F(B) = -bo x 2 + b-1 x 2-1 +b-2x2-2 + ... + b-(n-X) x 2-{n~l) where the range of F is -1 F 1 -2-(n-1). Consider the range of values representable in a 32-bit, signed, fixedpoint format. Interpreted as integers, the value range is approximately 0 to 2.15 x 109. If we consider them to be fractions, the range is approximately 4.55 x 10-10 to 1. Neither of these ranges is sufficient for scientific calculations, which might involve parameters like Avogadro's number (6.0247 x 1023 mole-1) or Planck's constant (6.6254 x 10-27erg s). Hence, we need to easily accommodate both very large integers and very small fractions. To do this, a computer must be able to represent numbers and operate on them in such a way that the position of the binary point is variable and is automatically adjusted as computation proceeds. In such a case, the binary point is
25

said to float, and the numbers are called floating-point numbers. This distinguishes them from fixed-point numbers, whose binary point is always in the same position. Because the position of the binary point in a floating-point number is variable, it must be given explicitly in the floating-point representation. For example, in the familiar decimal scientific notation, numbers may be written as 6.0247 x 10 23, 6.6254 -10-27, -1.0341 x 102, -7.3000 x 10-14, and so on. These numbers are said to be given to five significant digits. The scale factors (1023, 10-27, and so on) indicate the position of the decimal point with respect to the significant digits. By convention, when the decimal point is placed to the right of the first (nonzero) significant digit, the number is said to be normalized. Note that the base, 10, in the scale factor is fixed and does not need to appear explicitly in the machine representation of a floating-point number. The sign, the significant digits, and the exponent in the scale factor constitute the representation. We are thus motivated to define a floating-point number representation as one in which a number is represented by its sign, a string of significant digits, commonly called the mantissa, and an exponent to an implied base for the scale factor. FLOATING-POINT DATA Floating-point representation of numbers needs two registers. The first represents a signed fixed-point number and the second, the position of the radix point. For example, the representation of the decimal number +6132.789 is as follows:

The first register has a 0 in the most significant flip-flop position to denote a plus. The magnitude of the number is stored in a binary code in 28 flip-flops, with each decimal digit occupying 4 flip-flops. The number in the first register is considered a fraction, so the decimal point in the first register is fixed at the left of the most significant digit. The second register contains the decimal number + 4 (in binary code) to indicate that the actual position of the decimal point is four decimal positions to the right. This representation is equivalent to the number expressed by a fraction times 10 to an exponent, i.e., + 6132.789 is represented as +.6132789 x 10
+4

. Because of this

analogy, the contents of the first register are called the coefficient (and sometimes mantissa or fractional part) and the contents of the second register are called the exponent (or characteristic). The position of the actual decimal point may be outside the range of digits of the
26

coefficient register. For example, assuming sign-magnitude representation, the following contents:

Coefficient

exponent

represent the number +.2601000 x 10-4 = + .000026010000, which produces four more 0's on the left. On the other hand, the following contents:

Coefficient 0's on the right.

exponent

represent the number -.2601000 X 10-12 = - 260100000000, which produces five more In these examples, we have assumed that the coefficient is a fixed-point fraction. Some computers assume it to be an integer, so the initial decimal point in the coefficient register is to the right of the least significant digit. Another arrangement used for the exponent is to remove its sign bit altogether and consider the exponent as being "biased." For example, numbers between 10+49 and 10+50 can be represented with an exponent of two digits (without sign bit) and a bias of 50. The exponent register always contains the number E + 50, where E is the actual exponent. The subtraction of 50 from the contents of the register gives the desired exponent. This way, positive exponents are represented in the register in the range of numbers from 50 to 99. The subtraction of 50 gives the positive values from 00 to 49. Negative exponents are represented in the register in the range of 00 to 49. The subtraction of 50 gives the negative values in the range of -50 to - 1. A floating-point binary number is similarly represented with two registers, one to store the coefficient and the other, the exponent. For example, the number + 1001.110 can be represented as follows:
Sign Initial binary point Sign

0 1 0 0 1 1 1 0 0 0

00100

27

The coefficient register has ten flip-flops: one for sign and nine for magnitude. Assuming that the coefficient is a fixed-point fraction, the actual binary point is four positions to the right, so the exponent has the binary value +4. The number is represented in binary as .100111000 X 10100 (remember that 10100 in binary is equivalent to decimal 24). Floating-point is always interpreted to represent a number in the following form:

c - re
where c represents the contents of the coefficient register and e, the contents of the exponent register. The radix (base) r and the radix-point position in the coefficient are always assumed. Consider, for example, a computer that assumes integer representation for the coefficient and base 8 for the exponent. The octal number + 17.32 = + 1732 X 8-2will look like this

When the octal representation is converted to binary, the binary value of the registers becomes:

000111101101

1000010

Coefficient

exponent

A floating point number is said to normalized if the most significant position of the coefficient contains a nonzero digit. In this way, the coefficient has no leading zeros and contains the maximum possible number of significant digits. Consider, for example, a coefficient register that can accommodate five decimal digits and a sign. The number +.00357 X 103 = 3.57 is not normalized because it has two leading zeros and the unnormalized coefficient is accurate to three significant digits. The number can be normalized by shifting the coefficient two positions to the left and decreasing the exponent by 2 to obtain: +.35700 X 101 = 3.5700, which is accurate to five significant digits.
28

Arithmetic operations with floating-point number representation are more complicated than arithmetic operations with fixed-point numbers and their execution takes longer and requires more complex hardware. However, floating-point representation is more convenient because of the scaling problems involved with fixed-point operations. Many computers have a built-in capability to perform floating-point arithmetic operations. Those that do not have this hardware are usually programmed to operate in this mode. Adding or subtracting two numbers in floating-point representation requires first an alignment of the radix point, since the exponent part must be made equal before the coefficients are added or subtracted. This alignment is done by shifting one coefficient while its exponent is adjusted until it is equal to the other exponent. Floating-point multiplication or division requires no alignment of the radix point. The product can be formed by multiplying the two coefficients and adding the two exponents. Division is accomplished from the division with the coefficients and the subtraction of the divisior exponent from the exponent of the dividend.

IEEE Standard for Floating-Point Numbers:


We start with a general form and size for floating-point numbers in the decimal system and then relate this form to a comparable binary representation. A useful form is X1, X2, X3, X4, X5, X6, X7 x 10 y1y2 where Xi and Fi are decimal digits. Both the number of significant digits (7) and the exponent range (99) are sufficient for a wide range of scientific calculations h is possible to approximate this mantissa precision and scale factor range in a binary representation that occupies 32 bits, which is a standard computer word length. A 24bit mantissa can approximately represent a 7-digit decimal number, and an 8-bit exponent to an implied base of 2 provides a scale factor with a reasonable range. One bit is needed for the sign of the number. Since the leading nonzero bit of a normalized binary mantissa must be a 1, it does not have to be included explicitly in the representation. Therefore, a total of 32 bits is needed. This standard for representing floating-point numbers in 32 bits has been developed and specified in detail by the Institute of Electrical and Electronics Engineers (IEEE) [1]. The standard describes both the representation and the way in which the four basic arithmetic operations are to be performed. The 32-bit representation is given in Figure 20a. The sign of the number is given in the first bit, followed by a representation tor the exponent (to the base 2) of the scale factor. Instead of the signed exponent, E, e value actually stored in the exponent field is an
29

unsigned integer E' = E + 127.

Fig 20
This is called the excess-127 format. Thus, E is in the range 0 E 255. The end values of this range, 0 and 255, are used to represent special values, as described below. Therefore, the range of E' for normal values is 1 E 254. This means that the actual exponent, E, is in the range -126 E 127. The excess-x representation for exponents enables efficient comparison of the relative sizes of two floating-point numbers. The last 23 bits represent the mantissa. Since binary normalization is used, the mo significant bit of the mantissa is always equal to 1. This bit is not explicitly represented: it is assumed to be to the immediate left of the binary point. Hence, the 23 bits stored in the M field actually represent the fractional part of the mantissa, that is, the bits stored the right of the binary point. An example of a single-precision floatingpoint number h shown in Figure 20. The 32-bit standard representation in Figure 20a is called a single-precision rep resentation because it occupies a single 32-bit word. The scale factor has a range of 2130

26

to 2+127, which is approximately equal to 1038. The 24-bit mantissa provides approximately the same precision as a 7-digit decimal value. To provide more precision and range for floating-point numbers, the IEEE standard also specifies a double precision format, as shown in Figure 20. The double-precision format has increased exponent and mantissa ranges. The 11-bit excess-1023 exponent E has the range 1 E' 2046 for normal values, with 0 and 2047 used to indicate special values. as before. Thus, the actual exponent E is in the range -1022 E 1023, providing scale factors of 2-1022 to 21023 (approximately 10308). The 53-bit mantissa provides i precision equivalent to about 16 decimal digits. A computer must provide at least single-precision representation to conform to the IEEE standard. Double-precision representation is optional. The standard also specifies certain optional extended versions of both of these formats. The extended versions arc intended to provide increased precision and increased exponent range for the representation of intermediate values in a sequence of calculations. For example, the dot product of two vectors of numbers can be computed by accumulating the sum of product> : extended precision. The inputs are given in a standard precision, either single or double, and the answer is truncated to the same precision. The use of extended formats helps to reduce the size of the accumulated round-off error in a sequence of calculations. E\ tended formats also enhance the accuracy of evaluation of elementary functions such as sine, cosine, and so on. In addition to requiring the four basic arithmetic operations, the standard requires that the operations of remainder, square root, and conversion between binary and decimal representations be provided. We note two basic aspects of operating with floating-point numbers. First, if a number is not normalized, it can always be put in normalized form by shifting the fraction and adjusting the exponent. Figure 21 shows an un normalized value, 0.0010110...x 29 and its normalized version, 1.0110... x 26. Since the scale factor is in the form 2i shifting the mantissa right or left by one bit position is compensated by an increase or a decrease of 1 in the exponent, respectively. Second, as computations proceed, a number that does not fall in the representable range of normal numbers might be generated. In single precision, this means that its normalized representation requires an exponent less than -126 or greater than +127. In the first case, we say that underflow has occurred, and in the second case, we say that overflow has occurred. Both underflow and overflow are arithmetic exceptions that are considered below.

31

Fig 21
Special Values The end values 0 and 255 of the excess-127 exponent E' are used to represent special values. When E' = 0 and the mantissa fraction M is zero, the value exact 0 is represented. When E' = 255 and M = 0, the value is represented, where is the result of dividing a normal number by zero. The sign bit is still part of these representations, so there are 0 and representations. When E = 0 and M 0, denormal numbers are represented. Their value is O.M x 2-126. Therefore, they are smaller than the smallest normal number. There is no implied one to the left of the binary point, and M is any nonzero 23-bit fraction. The purpose of introducing denormal numbers is to allow for gradual underflow, providing an extension of the range of normal representable numbers that is useful in dealing with very small numbers in certain situations. When E' = 255 and M0, the value represented is called Not a Number (NaN). A NaN is the result of performing an invalid operation such as 0/0 or .

Exceptions
In conforming to the IEEE Standard, a processor must set exception flags if any of the following occur in performing operations: underflow, overflow, and divide by zero, inexact, invalid. We have already mentioned the first three. Inexact is the name for a result that requires rounding in order to be represented in one of the normal
32

formats. An invalid exception occurs if operations such as 0/0 or are attempted. When exceptions occur, the results are set to special values. If interrupts are enabled for any of the exception flags, system or user-defined routines are entered when the associated exception occurs. Alternatively, the application program can test for the occurrence of exceptions, as necessary, and decide how to proceed.

Arithmetic Operations on Floating-Point Numbers:


The rules apply to the single-precision IEEE standard format. These rules specify only the major steps needed to perform the four operations. Intermediate results for both mantissas and exponents might require more than 24 and 8 bits, respectively & overflow or an underflow may occur. These and other aspects of the operations must be carefully considered in designing an arithmetic unit that meets the standard. If their exponents differ, the mantissas of floating-point numbers must be shifted with respect to each other before they are added or subtracted. Consider a decimal example in which we wish to add 2.9400 x 102 to 4.3100 x 104. We rewrite 2.9400 x 102 as 0.0294 x 104 and then perform addition of the mantissas to get 4.3394 x 104. The rule for addition and subtraction can be stated as follows: Add/Subtract Rule The steps in addition (FA) or subtraction (FS) of floating-point numbers (s 1, e , f1) fad
{s2, e 2, f2) are as follows.

1. Unpack sign, exponent, and fraction fields. Handle special operands such as zero, infinity, or NaN(not a number). 2. Shift the significand of the number with the smaller exponent right by bits. 3. Set the result exponent er to max(e1,e2). 4. If the instruction is FA and s 1= s2 or if the instruction is FS and s 1 s2 then add the significands; otherwise subtract them. 5. Count the number z of leading zeros. A carry can make z = -1. Shift the result significand left z bits or right 1 bit if z = -1. 6. Round the result significand, and shift right and adjust z if there is rounding overflow, which is a carry-out of the leftmost digit upon rounding. 7. Adjust the result exponent by e r = er - z, check for overflow or underflow, and pack the result sign, biased exponent, and fraction bits into the result word.
e1 2 e

33

Operands 6.144 102 + 9.975 104 --------------

Operands 1.076 10-7 - 9.987 100-8


---------------------------------

Multiplication and division are somewhat easier than addition and subtraction, in that no alignment of mantissas is needed. Multiply Rule 1. Unpack signs, exponents, and significands. Handle exceptional operands. 2. Compute result sign, S r = S1 2, add exponents, er = e1+e2, and multiply ,S significands, fr = f1f2. 3. If necessary, normalize by one left shift and decrement result exponent. Round and shift right if rounding overflow occurs. 4. If the exponent is too positive, handle overflow, and if it is too negative, handle underflow. 5. Pack result, encoding or reporting exceptions

Divide Rule 1. Unpack. Handle exceptions. 2. Compute result sign ,Sr = S1 S2, subtract exponent of divisor from that of
34

dividend er, = e1 e2, and divide the significands, fr = f1 f2 3. If necessary, normalize by one right shift and increment result exponent. Round and correct for rounding overflow. 4. Handle overflow and underflow on exponent range as in multiply. 5. Pack result and treat exceptions.

Implementing Floating-Point Operations:


If all numbers have the same scale factor, addition and subtraction are easy, since f 2e + g 2e = (f+g) 2e provided that (f + g) does not overflow. The scale changes in multiplication and division because, even if both operands are scaled the same,

Multiplication and division compute a new scale factor for the result from those of the operands as shown below;

The hardware implementation of floating-point operations involves a considerable amount of logic circuitry. These operations can also be implemented by software routines. In either case, the computer must be able to convert input and output from and to the user's decimal representation of numbers. In most general-purpose processors, floating-point operations are available at the machine-instruction level, implemented in hardware. An example of the implementation of floating-point operations is shown in Figure 22. This is a block diagram of a hardware implementation for the addition and subtraction of 32-bit floating-point operands that have the format shown in Figure 20. Following the Add/Subtract rule, we see that the first step is to compare exponents to determine how far to shift the mantissa of the number with the smaller exponent. The shift-count value, n, is determined by the 8-bit subtractor circuit in the upper left corner of the figure. The magnitude of the difference E' A E'B, or n, is sent to the SHIFTER unit. If n is larger than the number of significant bits of the operands, then the answer is essentially the larger operand (except for guard and sticky-bit considerations in rounding), and shortcuts can be taken in deriving the result.

35

FIG 22: Floating point Arithmetic


The sign of the difference that results from comparing exponents determines which mantissa is to be shifted. Therefore, in step 1, the sign is sent to the SWAP network, if the sign is 0, then E'A > E'B and the mantissas MA and MB are sent straight through the SWAP network. This results in MB being sent to the SHIFTER, to be shifted n positions to the right. The other mantissa, Ma, is sent directly to the mantissa adder/subtractor. If the sign is 1, then E' A < E'B and the mantissas are swapped before they are sent to the SHIFTER. Step 2 is performed by the two-way multiplexer, MUX, near the bottom left corner of the figure. The exponent of the result, E, is tentatively determined as E' A if E'A > E'B, or E'B if E'A < E'B, based on the sign of the difference resulting from comparing exponents in step 1. Step 3 involves the major component, the mantissa adder/subtractor in the middle of the figure. The CONTROL logic determines whether the mantissas are to be added or subtracted. This is decided by the signs of the operands (S A and SB) and the operation (Add or Subtract) that is to be performed on the operands. The CONTROL logic also determines the sign of the result, SR. For example, if A is negative (SA = 1), B is positive (SB =0), and the operation is A - B, then the mantissas are added and the sign of the result is negative (S R = 1). On the other hand, if A and B are both positive and the operation is A - B, then the mantissas are subtracted. The sign of the result, S R, now depends on the mantissa subtraction operation. For instance, if E' A> E'B, then MA (shifted MB) is positive and the result is positive. But if E' B = E'A, then MB (shifted MA) is positive and the result is negative. This example shows that the sign from the exponent comparison is also required as an input to the CONTROL network. When E'A = E'B and the mantissas are subtracted, the sign of the mantissa adder/subtractor output determines the sign of the result. The reader should now be
36

able to construct the complete truth table for the CONTROL network. Step 4 of the Add/Subtract rule consists of normalizing the result of step 3, mantissa M. The number of leading zeros in M determines the number of bit shifts, X, to be applied to M. The normalized value is truncated to generate the 24-bit mantissa, MR, of the result. The value X is also subtracted from the tentative result exponent E' to generate the true result exponent, E' R. Note that only a single right shift might be needed to normalize the result. This would be the case if two mantissas of the form 1 .xx... were added. The vector M would then have the form 1 x.xx This would correspond to an X value of -1 in the figure. Let us consider the actual hardware that is needed to implement the blocks in Figure 22. The two 8-bit subtractors and the mantissa adder/subtractor can be implemented by combinational logic, as discussed earlier in this chapter. Because their outputs must be in sign-and-magnitude form, we must modify some of our earlier discussions. A combination of 1's-complement arithmetic and sign-and-magnitude representation is often used. Considerable flexibility is allowed in implementing the SHIFTER and the output normalization operation. If a design with a modest logic gate count is required, the operations can be implemented with shift registers. However, they can also be built as combinational logic units for high-performance, but in that case, a significant number of logic gates is needed. In high-performance processors, a significant portion of the chip area is assigned to floating-point operations. KEY CONCEPTS: FLOATING POINT REPRESENTATION Floating-point numbers are generally represented with the significand having a sign-magnitude representation and the exponent having a biased representation. The exponent base is implicit. Floating-point standards must specify the base, the representation, and the number of bits devoted to exponent and significand. Normalization eliminates multiple representations for the same value, and simplifies comparisons and arithmetic computations. Floating-point arithmetic operations are composed of multiple fixed-point operations on the exponents and significands. Floating-point addition and subtraction are more complicated than multiplication and division because they require comparison of exponents and shifting of the significands to "line up the binary points" prior to the actual addition or subtraction operation. Floating-point multiplication and division, on the other hand, require only a maximum 1-bit shift of the significand to normalize the numbers

37

CHAPTER 6:
BASIC PROCESSING UNIT:
The heart of any computer is the central processing unit (CPU). The CPU executes all the machine instructions and coordinates the activities of all other units during the execution of an instruction. This unit is also called as the Instruction Set Processor (ISP). By looking at its internal structure, we can understand how it performs the tasks of fetching, decoding, and executing instructions of a program. The processor is generally called as the central processing unit (CPU) or micro processing unit (MPU).An high-performance processor can be built by making various functional units operate in parallel. High-performance processors have a pipelined organization where the execution of one instruction is started before the execution of the preceding instruction is completed. In another approach, known as superscalar operation, several instructions are fetched and executed at the same time. Pipelining and superscalar architectures provide a very high performance for any processor. A typical computing task consists of a series of steps specified by a sequence of machine instructions that constitute a program. A program is a set of instructions performing a meaningful task. An instruction is command to the processor & is executed by carrying out a sequence of sub-operations called as micro-operations. Figure 1 indicates various blocks of a typical processing unit. It consists of PC, IR, ID, MAR, MDR, a set of register arrays for temporary storage, Timing and Control unit as main units.

Fundamental Concepts:
Execution of a program by the processor starts with the fetching of instructions one at a time, decoding the instruction and performing the operations specified. From memory, instructions are fetched from successive locations until a branch or a jump instruction is encountered. The processor keeps track of the address of the memory location containing the next instruction to be fetched using the program counter (PC) or Instruction Pointer (IP). After fetching an instruction, the contents of the PC are updated to point to the next instruction in the sequence. But, when a branch instruction is to be executed, the PC will be loaded with a different (jump/branch address).

38

Fig-1
Instruction register, IR is another key register in the processor, which is used to hold the op-codes before decoding. IR contents are then transferred to an instruction decoder (ID) for decoding. The decoder then informs the control unit about the task to be executed. The control unit along with the timing unit generates all necessary control signals needed for the instruction execution. Suppose that each instruction comprises 2 bytes, and that it is stored in one memory word. To execute an instruction, the processor has to perform the following three steps: 1. Fetch the contents of the memory location pointed to by the PC. The contents of this location are interpreted as an instruction code to be executed. Hence, they are loaded into the IR/ID. Symbolically, this operation can be written as IR [(PC)] 2. Assuming that the memory is byte addressable, increment the contents of the PC by 2, that is, PC [PC] + 2 3. Decode the instruction to understand the operation & generate the control signals necessary to carry out the operation. 4. Carry out the actions specified by the instruction in the IR. In cases where an instruction occupies more than one word, steps 1 and 2 must be repeated as many times as necessary to fetch the complete instruction. These two steps together are usually referred to as the fetch phase; step 3 constitutes the decoding phase; and step 4 constitutes the execution phase. To study these operations in detail, let us examine the internal organization of the processor. The main building blocks of a processor are interconnected in a variety of ways. A very simple organization is shown in Figure 2. A more complex structure that provides high performance will be presented at the end.

39

Fig 2
Figure shows an organization in which the arithmetic and logic unit (ALU) and all the registers are interconnected through a single common bus, which is internal to the processor. The data and address lines of the external memory bus are shown in Figure 7.1 connected to the internal processor bus via the memory data register, MDR, and the memory address register, MAR, respectively. Register MDR has two inputs and two outputs. Data may be loaded into MDR either from the memory bus or from the internal processor bus. The data stored in MDR may be placed on either bus. The input of MAR is connected to the internal bus, and its output is connected to the external bus. The control lines of the memory bus are connected to the instruction decoder and control logic block. This unit is responsible for issuing the signals that control the operation of all the units inside the processor and for interacting with the memory bus. The number and use of the processor registers RO through R(n - 1) vary considerably from one processor to another. Registers may be provided for general-purpose use by the programmer. Some may be dedicated as special-purpose registers, such as index registers or stack pointers. Three registers, Y, Z, and TEMP in Figure 2, have not been mentioned before. These registers are transparent to the programmer, that is, the programmer need not be concerned with them because they are never referenced explicitly by any instruction. They are used by the processor for temporary storage during execution of some instructions. These registers are never used for storing data generated by one instruction for later use by another instruction. The multiplexer MUX selects either the output of register Y or a constant value 4 to be
40

provided as input A of the ALU. The constant 4 is used to increment the contents of the program counter. We will refer to the two possible values of the MUX control input Select as Select4 and Select Y for selecting the constant 4 or register Y, respectively. As instruction execution progresses, data are transferred from one register to another, often passing through the ALU to perform some arithmetic or logic operation. The instruction decoder and control logic unit is responsible for implementing the actions specified by the instruction loaded in the IR register. The decoder generates the control signals needed to select the registers involved and direct the transfer of data. The registers, the ALU, and the interconnecting bus are collectively referred to as the data path. With few exceptions, an instruction can be executed by performing one or more of the following operations in some specified sequence: 1. Transfer a word of data from one processor register to another or to the ALU 2. Perform an arithmetic or a logic operation and store the result in a processor register 3. Fetch the contents of a given memory location and load them into a processor register 4. Store a word of data from a processor register into a given memory location We now consider in detail how each of these operations is implemented, using the simple processor model in Figure 2. Instruction execution involves a sequence of steps in which data are transferred from one register to another. For each register, two control signals are used to place the contents of that register on the bus or to load the data on the bus into the register. This is represented symbolically in Figure 3. The input and output of register Ri are connected to the bus via switches controlled by the signals Ri in and Riout respectively. When Riin is set to 1, the data on the bus are loaded into Ri. Similarly, when Riout, is set to 1, the contents of register Riout are placed on the bus. While Riout is equal to 0, the bus can be used for transferring data from other registers. Suppose that we wish to transfer the contents of register RI to register R4. This can be accomplished as follows: 1. Enable the output of register R1out by setting Rlout, tc 1. This places the contents of R1 on the processor bus. 2. Enable the input of register R4 by setting R4 in to 1. This loads data from the processor bus into register R4. All operations and data transfers within the processor take place within time periods defined by the processor clock. The control signals that govern a particular transfer are asserted at the start of the clock cycle. In our example, Rl out and R4in are set to 1. The registers consist of edge-triggered flip-flops. Hence, at the next active edge of the clock, the flip-flops that constitute R4 will load the data present at their inputs. At the same time, the control signals Rlout and R4in will return to 0. We will use this simple model of the timing of data transfers for the rest of this chapter. However, we should point out that other schemes are possible. For example, data transfers may use both the rising and falling edges of the clock. Also, when edge-triggered flip-flops are not used, two or more clock signals may be needed to guarantee proper transfer of data. This is known as multiphase clocking.
41

An implementation for one bit of register Ri is shown in Figure 7.3 as an example. A two-input multiplexer is used to select the data applied to the input of an edgetriggered D flip-flop. When the control input Ri in is equal to 1, the multiplexer selects the data on the bus. This data will be loaded into the flip-flop at the rising edge of the clock. When Riin is equal to 0, the multiplexer feeds back the value currently stored in the flip-flop. The Q output of the flip-flop is connected to the bus via a tri-state gate. When Ri out, is equal to 0, the gate's output is in the high-impedance (electrically disconnected) state. This corresponds to the open-circuit state of a switch. When Ri out, = 1, the gate drives the bus to 0 or 1, depending on the value of Q.

Register Transfer Language (RTL):


RTL is a modular, high level mathematical notation used to describe a digital system. Digital system of high complexity (LSI & more) can not be described using conventional state table method. Such systems are described using RTLs and various notations used in RTL are indicated below;

Registers are indicated using letters and numerals, part of a register is indicated using parentheses, etc.

Implementing a RTL statement:


A RTL statement is of the form control function: Micro-operation1, Microoperation2, . A control function is a single valued Boolean function which will be either TRUE or FALSE. The set of micro-operations is executed if the control function is true. Hence, to implement a RTL statement, first the control function is evaluated and if true, it generates an initiation signal to execute the micro-operation/s. Consider the following RTL statement; here micro-operation transfer contents of A
42

to B is executed only if condition xT1 is true. The hardware needed to implement the RTL statement is as shown in the diagram below.

Consider another RTL statement implementation; the hardware needed to implement the RTL statement

is as shown in the diagram below.

Various Arithmetic operations:


The following table illustrates various arithmetic operations and their RTL representation. Here, subtraction can also be implemented using complement method.

43

Various Logical operations:


The following table illustrates various logical operations and their RTL representation. Here, Ex-or is also considered as one of the basic operations.
Symbolic representation Fi Fi Fi Fi Ai Ai Bi Ai . Bi Ai Operation OR EX-OR AND NOT

Performing an Arithmetic or Logic Operation:

44

FIG 3
The ALU is a combinational circuit that has no internal storage. It performs arithmetic and logic operations on the two operands applied to its A and B inputs. In Figures 2 and 3, one of the operands is the output of the multiplexer MUX and the other operand. is obtained directly from the bus. The result produced by the ALU is stored temporarily in register Z. Therefore, a sequence of operations to add the contents of register Rl to those of register R2 and store the result in register R3 is 1. R1out, Yin 2. R2our, Select Y, Add, Zin 3. Zout, R3in The signals whose names are given in any step are activated for the duration of the clock cycle corresponding to that step. All other signals are inactive. Hence, in step 1, the output of register Rl and the input of register Y are enabled, causing the contents of Rl to be transferred over the bus to Y In step 2, the multiplexer's Select signal is set to SelectY, causing the multiplexer to gate the contents of register Y to input A of the ALU. At the same time, the contents of register R2 are gated onto the bus and, hence, to input B. The function performed by the ALU depends on the signals applied to its control lines. In this case, the Add line is set to 1, causing the output of the ALU to be the sum of the two numbers at inputs A and B. This sum is loaded into register Z because its input control signal is activated. In step 3, the contents of register Z are transferred to the destination register, R3. This last transfer cannot be carried out during step 2, because only one register output can be connected to the bus during any clock cycle.

45

In this introductory discussion, we assume that there is a dedicated signal for each function to be performed. For example, we assume that there are separate control signals to specify individual ALU operations, such as Add, Subtract, XOR, and so on. In reality, some degree of encoding is likely to be used. For example, if the ALU can perform eight different operations, three control signals would suffice to specify the required operation.

Fetching a Word from Memory:


To fetch a word of information from memory, the processor has to specify the address of the memory location where this information is stored and request a Read operation. This applies whether the information to be fetched represents an instruction in a program or an operand specified by an instruction. The processor transfers the required address to the MAR, whose output is connected to the address lines of the memory bus. At the same time, the processor uses the control lines of the memory bus to indicate that a Read operation is needed. When the requested data are received from the memory they are stored in register MDR (MDR), from where they can be transferred to other registers in the processor.

The connections for register MDR are illustrated in Figure 4. It has four control
46

signals: MDRin and MDRout, control the connection to the internal bus, and MDR inE and MDR out E control the connection to the external bus. The circuit in Figure 7.3 is easily modified to provide the additional connections. A three-input multiplexer can be used, with the memory bus data line connected to the third input. This input is selected when MDRinE = 1. A second tri-state gate, controlled by MDRout,E can be

Fig 4
used to connect the output of the flip-flop to the memory bus. During memory Read and Write operations, the timing of internal processor operations must be coordinated with the response of the addressed device on the memory bus. The processor completes one internal data transfer in one clock cycle. The speed of operation of the addressed device, on the other hand, varies with the device. We saw in Chapter 5 that modern processors include a cache memory on the same chip as the processor. Typically, a cache will respond to a memory read request in one clock cycle. However, when a cache miss occurs, the request is forwarded to the main memory, which introduces a delay of several clock cycles. A read or write request may also be intended for a register in a memory-mapped I/O device. Such I/O registers are not cached, so their accesses always take a number of clock cycles. To accommodate the variability in response time, the processor waits until it receives an indication that the requested Read operation has been completed. We will assume that a control signal called Memory-Function-Completed (MFC) is used for this purpose. The addressed device sets this signal to 1 to indicate that the contents of the specified location have been read and are available on the data lines of the memory bus. As an example of a read operation, consider the instruction Move (R1), R2. The actions needed to execute this instruction are: 1. MAR [Rl] 2. Start a Read operation on the memory bus 3. Wait for the MFC response from the memory 4. Load MDR from the memory bus 5. R2 [MDR] These actions may be carried out as separate steps, but some can be combined into
47

a single step. Each action can be completed in one clock cycle, except action 3 which requires one or more clock cycles, depending on the speed of the addressed device. For simplicity, let us assume that the output of MAR is enabled all the time. Thus, the contents of MAR are always available on the address lines of the memory bus. This is the case when the processor is the bus master. When a new address is loaded into MAR, it will appear on the memory bus at the beginning of the next clock cycle, as shown in Figure 5a. A Read control signal is activated at the same time MAR is loaded. This signal will cause the bus interface circuit to send a read command, MR, on the bus. With this arrangement, we have combined actions 1 and 2 above into a single control step. Actions 3 and 4 can also be combined by activating control signal MDRout while waiting for a response from the memory. Thus, the data received from the memory are loaded into MDR at the end of the clock cycle in which the MFC signal is received. In the next clock cycle, MDR out, is activated to transfer the data to register R2. This means that the memory read operation requires three steps, which can be described by the signals being activated as follows: 1. Rlout, MARin Read 2. MDR,inE WMFC 3. MDRout, R2in where WMFC is the control signal that causes the processor's control circuitry to wait for the arrival of the MFC signal. Figure 5 shows that MDR inE is set to 1 for exactly the same period as the read command, MR. Hence, in subsequent discussion, we will not specify the value of MDRinE explicitly, with the understanding that it is always equal to MR.

Fig 5a Storing a Word in Memory:


Writing a word into a memory location follows a similar procedure. The desired address is loaded into MAR. Then, the data to be written are loaded into MDR, and a Write command is issued. Hence, executing the instruction Move R2,(R1) requires the following sequence: 1. Rlout ,MAR in
48

2. 3.

R2out, MDRin Write MDR outE WMFC

Figure 5b shown indicates the timing waveforms for a memory write operation. During a write operation, the data will be present on the bus for the entire time and the device accepts it when it is ready.

Fig 5b
As in the case of the read operation, the Write control signal causes the memory bus interface hardware to issue a Write command on the memory bus. The processor remains in step 3 until the memory operation is completed and an MFC response is received. Figure 6 indicates the complete bus transfer and inter- register transfers. The source registers are selected using multiplexer bank. The destination register is selected using the decoder.

49

Fig 6 Execution of a Complete Instruction:


Let us now put together the sequence of elementary operations required to execute one instruction. Consider the instruction Add (R3), R1 which adds the contents of a memory location pointed to by R3 to register R1. Executing this instruction requires the following actions: 1. Fetch the instruction. 2. Fetch the first operand (the contents of the memory location pointed to by R3). 3. Perform the addition. 4 .Load the result into Rl.

Fig 7
50

The listing shown in figure 7 above indicates the sequence of control steps required to perform these operations for the single-bus architecture of Figure 2. Instruction execution proceeds as follows. In step 1, the instruction fetch operation is initiated by loading the contents of the PC into the MAR and sending a Read request to the memory. The Select signal is set to Select4, which causes the multiplexer MUX to select the constant 4. This value is added to the operand at input B, which is the contents of the PC, and the result is stored in register Z. The updated value is moved from register Z back into the PC during step 2, while waiting for the memory to respond. In step 3, the word fetched from the memory is loaded into the IR. Steps 1 through 3 constitute the instruction fetch phase, which is the same for all instructions. The instruction decoding circuit interprets the contents of the IR at the beginning of step 4. This enables the control circuitry to activate the control signals for steps 4 through 7, which constitute the execution phase. The contents of register R3 are transferred to the MAR in step 4, and a memory read operation is initiated. Then the contents of Rl are transferred to register Y in step 5, to prepare for the addition operation. When the Read operation is completed, the memory operand is available in register MDR, and the addition operation is performed in step 6. The contents of MDR are gated to the bus, and thus also to the B input of the ALU, and register Y is selected as the second input to the ALU by choosing Select Y. The sum is stored in register Z, then transferred to Rl in step 7. The End signal causes a new instruction fetch cycle to begin by returning to step 1. This discussion accounts for all control signals in Figure 7.6 except Y in step 2. There is no need to copy the updated contents of PC into register Y when executing the Add instruction. But, in Branch instructions the updated value of the PC is needed to compute the Branch target address. To speed up the execution of Branch instructions, this value is copied into register Y in step 2. Since step 2 is part of the fetch phase, the same action will be performed for all instructions. This does not cause any harm because register Y is not used for any other purpose at that time.

Branch Instructions:
A branch instruction replaces the contents of the PC with the branch target address. This address is usually obtained by adding an offset X, which is given in the branch instruction, to the updated value of the PC. Listing in figure 8 below gives a control sequence that implements an unconditional branch instruction. Processing starts, as usual, with the fetch phase. This phase ends when the instruction is loaded into the IR in step 3. The offset value is extracted from the IR by the instruction decoding circuit, which will also perform sign extension if required. Since the value of the updated PC is already available in register Y, the offset X is gated onto the bus in step 4, and an addition operation is performed. The result, which is the branch target address, is loaded into the PC in step 5. The offset X used in a branch instruction is usually the difference between the branch target address and the address immediately following the branch instruction.
51

]
Fig 8
For example, if the branch instruction is at location 2000 and if the branch target address is 2050, the value of X must be 46. The reason for this can be readily appreciated from the control sequence in Figure 7. The PC is incremented during the fetch phase, before knowing the type of instruction being executed. Thus, when the branch address is computed in step 4, the PC value used is the updated value, which points to the instruction following the branch instruction in the memory. Consider now a conditional branch. In this case, we need to check the status of the condition codes before loading a new value into the PC. For example, for a Branch-on-negative (Branch<0) instruction, step 4 is replaced with Offset-field-of-IRout Add, Zin, If N = 0 then End Thus, if N = 0 the processor returns to step 1 immediately after step 4. If N = 1, step 5 is performed to load a new value into the PC, thus performing the branch operation.

Multiple-Bus Organization:
The resulting control sequences shown are quite long because only one data item can be transferred over the bus in a clock cycle. To reduce the number of steps needed, most commercial processors provide multiple internal paths that enable several transfers to take place in parallel. Figure 7 depicts a three-bus structure used to connect the registers and the ALU of a processor. All general-purpose registers are combined into a single block called the register file. In VLSI technology, the most efficient way to implement a number of registers is in the form of an array of memory cells similar to those used in the implementation of random-access memories (RAMs) described in Chapter 5. The register file in Figure 9 is said to have three ports. There are two outputs, allowing the contents of two different registers to be accessed simultaneously and have their contents placed on buses A and B. The third port allows the data on bus C to be loaded into a third register during the same clock cycle. Buses A and B are used to transfer the source operands to the A and B inputs of the ALU, where an arithmetic or logic operation may be performed. The result is transferred to the destination over bus C. If needed, the ALU may simply pass one of its two input operands unmodified to bus C. We will call the ALU control signals for
52

such an operation R=A or R=B. The three-bus arrangement obviates the need for registers Y and Z in Figure 2. A second feature in Figure 9 is the introduction of the Incremental unit, which is used to increment the PC by 4.. The source for the constant 4 at the ALU input multiplexer is still useful. It can be used to increment other addresses, such as the memory addresses in Load Multiple and Store Multiple instructions.

Fig 9
Consider the three-operand instruction

Add R4,R5,R6

Fig 10
The control sequence for executing this instruction is given in Figure 10. In step 1, the contents of the PC are passed through the ALU, using the R=B control
53

signal, and loaded into the MAR to start a memory read operation. At the same time the PC is incremented by 4. Note that the value loaded into MAR is the original contents of the PC. The incremented value is loaded into the PC at the end of the clock cycle and will not affect the contents of MAR. In step 2, the processor waits for MFC and loads the data received into MDR, then transfers them to IR in step 3. Finally, the execution phase of the instruction requires only one control step to complete, step 4. By providing more paths for data transfer a significant reduction in the number of clock cycles needed to execute an instruction is achieved.

Hardwired Control:
To execute instructions, the processor must have some means of generating the control signals needed in the proper sequence. Computer designers use a wide variety of techniques to solve this problem. The approaches used fall into one of two categories: hardwired control and micro programmed control. We discuss each of these techniques in detail, starting with hardwired control in this section. Consider the sequence of control signals given in Figure 7. Each step in this sequence is completed in one clock period. A counter may be used to keep track of the control steps, as shown in Figure 11. Each state, or count, of this counter corresponds to one control step. The required control signals are determined by the following information: 1. Contents of the control step counter 2. Contents of the instruction register 3. Contents of the condition code flags 4. External input signals, such as MFC and interrupt requests

Fig 11
54

To gain insight into the structure of the control unit, we start with a simplified view of the hardware involved. The decoder/encoder block in Figure 11 is a combinational circuit that generates the required control outputs, depending on the state of all its inputs. By separating the decoding and encoding functions, we obtain the more detailed block diagram in Figure 12. The step decoder provides a separate signal line for each step, or time slot, in the control sequence. Similarly, the output of the instruction decoder consists of a separate line for each machine instruction. For any instruction loaded in the IR, one of the output lines INS 1 through INSm is set to 1, and all other lines are set to 0. (For design details of decoders, refer to Appendix A.) The input signals to the encoder block in Figure 12 are combined to generate the individual control signals Yin, PCout, Add, End, and so on. An example of how the encoder generates the Zin control signal for the processor organization in Figure 2 is given in Figure 13. This circuit implements the logic function Zin=T1+T6 - ADD + T4-BR+--This signal is asserted during time slot Ti for all instructions, during T 6 for an Add instruction, during T4 for an unconditional branch instruction, and so on. The logic function for Zin is derived from the control sequences in Figures 7 and 8. As another example, Figure 14 gives a circuit that generates the End control signal from the logic function End = T7 ADD + T5 BR + (T5 N + T4 N) BRN + The End signal starts a new instruction fetch cycle by resetting the control step counter to its starting value. Figure 12 contains another control signal called RUN. When

Fig 12
set to 1, RUN causes the counter to be incremented by one at the end of every clock cycle. When RUN is equal to 0, the counter stops counting. This is needed whenever the WMFC signal is issued, to cause the processor to wait for the reply from the memory.
55

Fig 13a
The control hardware shown can be viewed as a state machine that changes from one state to another in every clock cycle, depending on the contents of the instruction register, the condition codes, and the external inputs. The outputs of the state machine are the control signals. The sequence of operations carried out by this machine is determined by the wiring of the logic elements, hence the name "hardwired." A controller that uses this approach can operate at high speed. However, it has little flexibility, and the complexity of the instruction set it can implement is limited.

Fig 13b A Complete Processor:


The heart of any processor is an ALU; ALU performs all arithmetic and logical operations. The design of an ALU can be studied under two separate headlines; Arithmetic circuit design and Logic circuit design. Figure 14a indicates the block diagram of an ALU. The circuit diagram of a two bit ALU is shown in figure 14b.

56

Fig 14a
A complete processor can be designed using the structure shown in Figure 14c. This structure has an instruction unit that fetches instructions from an instruction cache or from the main memory when the desired instructions are not already in the cache. It has separate processing units to deal with integer data and floating-point data. Each of these units can be organized as shown in Figure 9. A data cache is inserted between these units and the main memory.

57

Fig 14b

Using separate caches for instructions and data is common practice in many processors today. Other processors use a single cache that stores both instructions and data. The processor is connected to the system bus and , hence, to the rest of the computer, by means o f a bus interface. Although we have shown just one integer and one floating point unit in Figure 14 a processor may include several units of each type to increase the potential for concurrent operations.

Fig 14b
58

MICROPROGRAMMED CONTROL:
ALU is the heart of any computing system, while Control unit is its brain. The design of a control unit is not unique; it varies from designer to designer. Some of the commonly used control logic design methods are; Sequence Reg & Decoder method Hard-wired control method PLA control method Micro-program control method

The control signals required inside the processor can be generated using a control step counter and a decoder/ encoder circuit. Now we discuss an alternative scheme, called micro programmed control, in which control signals are generated by a program similar to machine language programs.

Fig 15
First, we introduce some common terms. A control word (CW) is a word whose individual bits represent the various control signals in Figure 12. Each of the control steps in the control sequence of an instruction defines a unique combination of Is and Os in the CW. The CWs corresponding to the 7 steps of Figure 6 are shown in Figure 15. We have assumed that Select Y is represented by Select = 0 and Select4 by Select
59

= 1. A sequence of CWs corresponding to the control sequence of a machine instruction constitutes the micro routine for that instruction, and the individual control words in this micro routine are referred to as microinstructions. The micro routines for all instructions in the instruction set of a computer are stored in a special memory called the control store. The control unit can generate the control signals for any instruction by sequentially reading the CWs of the corresponding micro routine from the control store. This suggests organizing the control unit as shown in Figure 16. To read the control words sequentially from the control store, a micro program counter (PC) is used. Every time a new instruction is loaded into the IR, the output of the block labeled "starting address generator" is loaded into the PC. The PC is then automatically incremented by the clock, causing successive microinstructions to be read from the control store. Hence, the control signals are delivered to various parts of the processor in the correct sequence. One important function of the control unit cannot be implemented by the simple organization in Figure 16. This is the situation that arises when the control unit is required to check the status of the condition codes or external inputs to choose between alternative courses of action. In the case of hardwired control, this situation is handled by including an appropriate logic function, in the encoder circuitry. In micro programmed control, an alternative approach is to use conditional branch microinstructions. In addition to the branch address, these microinstructions specify which of the external inputs, condition codes, or, possibly, bits of the instruction register should be checked as a condition for branching to take place. The instruction Branch <0 may now be implemented by a micro routine such as that shown in Figure 17. After loading this instruction into IR, a branch

Fig 16

60

Fig 17
microinstruction transfers control to the corresponding micro routine, which is assumed to start at location 25 in the control store. This address is the output of staring address generator block codes. If this bit is equal to 0, a branch takes place to location 0 to fetch a new machine instruction. Otherwise, the microinstruction at location 0 to fetch a new machine instruction. Otherwise the microinstruction at location 27 loads this address into the PC

Fig 18
To support micro program branching, the organization of the control unit should be modified as shown in Figure 18. The starting address generator block of Figure 16 becomes the starting and branch address generator. This block loads a new address into the PC when a microinstruction instructs it to do so. To allow implementation of a conditional branch, inputs to this block consist of the external inputs and condition codes as well as the contents of the instruction register. In this control unit, the PC is incremented every time a new microinstruction is fetched from the micro program memory, except in the following situations: 1. When a new instruction is loaded into the IR, the PC is loaded with the
61

starting address of the micro routine for that instruction. 2. When a Branch microinstruction is encountered and the branch condition is satisfied, the PC is loaded with the branch address. 3. When an End microinstruction is encountered, the PC is loaded with the address of the first CW in the micro routine for the instruction fetch cycle

Microinstructions
Having described a scheme for sequencing microinstructions, we now take a closer look at the format of individual microinstructions. A straightforward way to structure microinstruction is to assign one bit position to each control signal, as in Figure 15. However, this scheme has one serious drawback assigning individual bits to each control signal results in long microinstructions because the number of required signals is usually large. Moreover, only a few bits are set to 1 (to be used for active gating) in any given microinstruction, which means the available bit space is poorly used. Consider again the simple processor of Figure 2, and assume that it contains only four general-purpose registers, R0, Rl, R2, and R3. Some of the connections in this processor are permanently enabled, such as the output of the IR to the decoding circuits and both inputs to the ALU. The remaining connections to various registers require a total of 20 gating signals. Additional control signals not shown in the figure are also needed, including the Read, Write, Select, WMFC, and End signals. Finally, we must specify the function to be performed by the ALU. Let us assume that 16 functions are provided, including Add, Subtract, AND, and XOR. These functions depend on the particular ALU used and do not necessarily have a one-to-one correspondence with the machine instruction OP codes. In total, 42 control signals are needed. If we use the simple encoding scheme described earlier, 42 bits would be needed in each microinstruction. Fortunately, the length of the microinstructions can be reduced easily. Most signals are not needed simultaneously, and many signals are mutually exclusive. For example, only one function of the ALU can be activated at a time. The source for a data transfer must be unique because it is not possible to gate the contents of two different registers onto the bus at the same time. Read and Write signals to the memory cannot be active simultaneously. This suggests that signals can be grouped so that all mutually exclusive signals are placed in the same group. Thus, at most one micro operation per group is specified in any microinstruction. Then it is possible to use a binary coding scheme to represent the signals within a group. For example, four bits suffice to represent the 16 available functions in the ALU. Register output control signals can be placed in a group consisting of PC out, MDRout, Zout, Offsetout, R0out Rlout, R2out, R3out, and TEMPout. Any one of these can be selected by a unique 4-bit code. Further natural groupings can be made for the remaining signals. Figure 19 shows an example of a partial format for the microinstructions, in which each group occupies a field large enough to contain the required codes. Most fields must include one inactive code for the case in which no action is required. For example, the all-zero pattern in Fl indicates that none of the registers that may be specified in this field should have its contents placed on the bus. An inactive code is not needed in all fields.
62

For example, F4 contains 4 bits that specify one of the 16 operations performed in the ALU. Since no spare code is included, the ALU is active during the execution of every microinstruction. However, its activity is monitored by the rest of the machine through register Z, which is loaded only when the Zin signal is activated. Grouping control signals into fields requires a little more hardware because decoding circuits must be used to decode the bit patterns of each field into individual control signals. The cost of this additional hardware is more than offset by the reduced number of bits in each microinstruction, which results in a smaller control store. In Figure 19, only 20 bits are needed to store the patterns for the 42 signals. So far we have considered grouping and encoding only mutually exclusive control signals. We can extend this idea by enumerating the patterns of required signals in all possible microinstructions. Each meaningful combination of active control signals can

Fig 19
then be assigned a distinct code that represents the microinstruction. Such full encoding is likely to further reduce the length of micro words but also to increase the complexity of the required decoder circuits. Highly encoded schemes that use compact codes to specify only a small number of control functions in each microinstruction are referred to as a vertical organization. On the other hand, the minimally encoded scheme of Figure 15, in which many resources can be controlled with a single microinstruction, is called a horizontal organization. The horizontal approach is useful when a higher operating speed is desired and when the machine structure allows parallel use of resources. The vertical approach results in considerably slower operating speeds because more microinstructions are needed to perform the desired control functions. Although fewer bits are required for each microinstruction, this does not imply that the total number of
63

bits in the control store is smaller. The significant factor is that less hardware is needed to handle the execution of microinstructions. Horizontal and vertical organizations represent the two organizational extremes in micro programmed control. Many intermediate schemes are also possible, in which the degree of encoding is a design parameter. The layout in Figure 19 is a horizontal organization because it groups only mutually exclusive micro operations in the same fields. As a result, it does not limit in any way the processor's ability to perform various micro operations in parallel. Although we have considered only a subset of all the possible control signals, this subset is representative of actual requirements. We have omitted some details that are not essential for understanding the principles of operation.

Micro program Sequencing


The simple micro program example in Figure 15 requires only straightforward sequential execution of microinstructions, except for the branch at the end of the fetch phase. If each machine instruction is implemented by a micro routine of this kind, the micro control structure suggested in Figure 18, in which a PC governs the sequencing, would be sufficient. A micro routine is entered by decoding the machine instruction into a starting address that is loaded into the PC. Some branching capability within the micro program can be introduced through special branch microinstructions that specify the branch address; similar to the way branching is done in machine-level instructions. With this approach, writing micro programs is fairly simple because standard software techniques can be used. However, this advantage is countered by two major disadvantages. Having a separate micro routine for each machine instruction results in a large total number of microinstructions and a large control store. If most machine instructions involve several addressing modes, there can be many instruction and addressing mode combinations. A separate micro routine for each of these combinations would produce considerable duplication of common parts. We want to organize the micro program so that the micro routines share as many common parts as possible. This requires many branch microinstructions to transfer control among the various parts. Hence, a second disadvantage arises execution time is longer because it takes more time to carry out the required branches. Consider a more complicated example of a complete machine instruction

Add

src,Rdst

which adds the source operand to the contents of register Rdst and places the sum in Rdst, the destination register. Let us assume that the source operand can be specified in the following addressing modes: register, auto increment, auto decrement, and indexed, as well as the indirect forms of these four modes. We now use this instruction in conjunction with the processor structure in Figure 2 to demonstrate a possible micro programmed implementation. A suitable micro program is presented in flowchart form, for easier understanding, in Figure 20. Each box in the chart corresponds to a microinstruction that controls the transfers and operations indicated within the box. The microinstruction is located.

64

Fig 20
at the address indicated by the octal number above the upper right-hand corner of the box. Each octal digit represents three bits. We use the octal notation in this example as a convenient shorthand notation for binary numbers. Most of the flowchart in the figure is self-explanatory, but some details warrant elaboration. We will explain the issues involved first, and then examine the flow of microinstructions in the figure in some detail.

Branch Address Modification Using Bit-ORing:


The micro program in Figure 20 shows that branches are not always made to a single branch address. This is a direct consequence of combining simple micro routines by sharing common parts. Consider the point labeled in the figure. At this point, it is necessary to choose between actions required by direct and indirect addressing modes. If the indirect mode is specified in the instruction, then the microinstruction in location 170 is performed to fetch the operand from the memory. If the direct mode is specified, this fetch must be bypassed by branching immediately to location 171. The most efficient way to bypass microinstruction 170 is to have the preceding branch microinstructions specify the address 170 and then use an OR gate to change the leastsignificant bit of this address to 1 if the direct addressing mode is involved. This is known as the bit-ORing technique for modifying branch addresses.
65

An alternative to the bit-ORing approach is to use two conditional branch microinstructions allocations 123,143, and 166. Another possibility is to include two next address fields within a branch microinstruction, one for the direct and one for the indirect address modes. Both of these alternatives are inferior to the bit-ORing technique.

Wide-Branch Addressing
Figure 20 includes a wide branch in the microinstruction at location 003. The instruction decoder, abbreviated InstDec in the figure, generates the starting address of the micro routine that implements the instruction that has just been loaded into the IR. In our example, register IR contains the Add instruction, for which the instruction decoder generates the microinstruction address 101. However, this address cannot be loaded as is into the micro program counter. The source operand of the Add instruction can be specified in any of several addressing modes. The figure shows five possible branches that the Add instruction may follow. From left to right these are the indexed, auto decrement, auto increment, register direct, and register indirect addressing modes. The bit-ORing technique described above can be used at this point to modify the starting address generated by the instruction decoder to reach the appropriate path. For the addresses shown in the figure, bit-ORing should change the address 101 to one of the five possible address values, 161,141,121, 101, or 111, depending on the addressing mode used in the instruction.

USE OF WMFC:
We have assumed that it is possible to issue a wait for MFC command in a branch microinstruction. This is done in the microinstruction at location 112, for example, which causes a branch to the microinstruction in location 171. Combining these two operations introduces a subtle problem. The WMFC signal means that the microinstruction may take several clock cycles to complete. If the branch is allowed to happen in the first clock cycle, the microinstruction at location 171 would be fetched and executed prematurely. To avoid this problem, the branch must not take place until the memory transfer in progress is completed, that is, the WMFC signal must inhibit any change in the contents of the micro program counter during the waiting period. Let us examine one path of the flowchart in Figure 20 in more detail. Consider the case in which the source operand is accessed in the auto increment mode. This is the path needed to execute the instruction

Add (Rsrc) +,Rdst


where Rsrc and Rdst are general-purpose registers in the machine. Figure 21 shows the complete micro routine for fetching and executing this instruction. We assume that the instruction has a 3-bit field used to specify the addressing mode for the source operand, as shown. Bit patterns 11, 10,01, and 00, located in bits 10 and 9, denote the
66

indexed, auto decrement, auto increment, and register modes, respectively. For each of these modes, bit 8 is used to specify the indirect version. For example, 010 in the mode field specifies the direct version of the auto increment mode, whereas 011 specifies the indirect version. We also assume that the processor has 16 registers that can be used for addressing purposes, each specified using a 4-bit code. Thus, the source operand is fully specified using the mode field and the register indicated by bits 7 through 4. The destination operand is in the register specified by bits 3 through 0. Since any of the 16 general-purpose registers may be involved in determining the source and destination operand locations, the microinstructions refer to the respective control signals only as Rsrc out, Rsrcin Rdstout, and Rdstin. These signals must be translated into specific register transfer signals by the decoding circuitry connected to the Rsrc and Rdst address fields of the IR. This means that there are two stages of decoding. First, the microinstruction field must be decoded to determine that an Rsrc or Rdst register is involved. The decoded output is then used to gate the contents of the Rsrc or Rdst fields in the IR into a second decoder, which produces the gating signals for the actual registers R0 to R15. The micro program in Figure 20 has been derived by combining the microrou-tines for all possible values in the mode field, resulting in a structure that requires many branch points. The example in Figure 21 has two branch points, so two branch microinstructions are required. In each case, the expression in brackets indicates the branch address that is to be loaded into the PC and how this address is modified using the bit-ORing scheme. Consider the microinstruction at location 123 as an example. Its unmodified version causes a branch to the microinstruction at location 170, which causes another fetch from the main memory corresponding to an indirect addressing mode. For a direct addressing mode, this fetch is bypassed by ORing the inverse of the indirect bit in the src address field (bit 8 in the IR) with the 0 bit position of the PC. Another example of the use of bit ORing is the microinstruction in location 003. There are five starting addresses for the micro routine that implements the Add instruction in question, depending on the address mode specified for the source operand. These address differ in the middle octal digit only. Hence, the required branch is implemented by using bit ORing to modify the middle octal digit of he pattern 101 obtained form the instruction decoder. The 3 bits to be ORed with the digit are supplied by the decoding circuitry connected to the src address mode filed (bits 8,9 and 10 of the IR). Microinstruction address has been chosen to make this modification easy to implement; bit 4 and 5 of the PC are set directly for bit 9 and 10 in the IR. This suffices to select the appropriate macroinstruction for all src address modes except one. The register indirect mode is covered by setting bit 3 of PC to 1 when [IR10].[IR9].[IR8] is equal to 1 register indirect is a special case, because it is the only indirect mode that does use the microinstruction at 170

67

Fig 21 Microinstructions with Next-Address Field:


The microprogram in Figure 20 requires several branch microinstructions. These microinstructions perform no useful operation in the datapath; they are needed only to determine the address of the next microinstruction. Thus, they detract from the operating speed of the computer. The situation can become significantly worse when other microroutines are considered. The increase in branch microinstructions stems partly from limitations in the ability to assign successive addresses to all microinstructions that are generally executed in consecutive order. This problem prompts us to reevaluate the sequencing technique built around an incrementable PC. A powerful alternative is to include an address field as a part of every microinstruction to indicate the location of the next microinstruction to be fetched. This means, in effect, that every microinstruction becomes a branch microinstruction, in addition to its other functions. The flexibility of this approach comes at the expense of additional bits for the address field. The severity of this penalty can be assessed as follows: In a typical computer, it is possible to design a complete microprogram with fewer than 4K microinstructions, employing perhaps 50 to 80 bits per microinstruction. This implies that an address field of 12 bits is required. Therefore, approximately one-sixth of the control store capacity would be devoted to addressing. Even if more extensive microprograms are needed, the address field would be only slightly larger. The most obvious advantage of this approach is that separate branch microinstructions are virtually eliminated. Furthermore, there are few limitations in assigning addresses to microinstructions. These advantages more than offset any negative attributes and make the scheme very attractive. Since each instruction contains the address of the next instruction, there is no need for a counter to keep track of sequential addresses. Hence, the PC is replaced with a microinstruction address register (AR), which is loaded from the next-address field in each microinstruction. A new control structure that incorporates this feature and supports bit-ORing is shown in Figure 22. The next68

address bits are fed through the OR gates to the AR, so mat the address can be modified on the basis of the data in the IR, external inputs, and condition codes. The decoding circuits generate the starting address of a given microroutine on the basis of the OP code in the IR. Let us now reconsider the example of Figure 21 using the microprogrammed control structure of Figure 22. We need several control signals that are not included in the microinstruction format in Figure 19. Instead of referring to registers R0 to R15 explicitly, we use the names Rsrc and Rdst, which can be decoded into the actual control signals with the data in the src and dst fields of the IR. Branching with the bitORing technique requires that we include the appropriate commands in the microinstructions. In the flowchart of Figure 20, bit-ORing is needed in microinstruction 003 to determine the address of the next microinstruction based on the addressing mode of the source operand. The addressing mode is indicated by bits 8 through 10 of the instruction register, as shown in Figure 21. Let the signal OR mode control whether or not this bit-ORing is used. In microinstructions 123, 143,and 166, bit-ORing is used to decide if indirect addressing of the source operand is to be used. We use the signal ORindsrc for this purpose.

Fig 22
For simplicity, we use separate bits in the microinstructions to denote these signals. One bit in the microinstruction is used to indicate when the output of the instruction decoder is to be gated in to the AR. Finally, each microinstruction contains an 8-bit field that holds the address of the next microinstructions. Figure 23 shows a complete format for these microinstructions. This format is expansion of the format in figure 19 Using such microinstructions, we can implement the micro routine of figure 21 As shown in figure 24. The revised routine has one less microinstruction. The branch microinstruction at location 123 has been combined with the microinstruction immediately preceding it. When microinstruction sequencing is controlled by a PC, the End signal is used to reset the PC to point to the starting address of the microinstruction that fetches the next machine instruction to be executed. In our example, this starting address 0008. How ever, the micro routine in figure 24 does not terminate by producing the End signal. In an organization such as this, the starting
69

address is not specified by a resetting mechanism triggered by the End signal instead, it is specified explicitly in the FO field.

Fig 23

Fig 23
It shows how control signals can be decoded from the microinstruction fields and used to control sequencing. Detailed circuitry for bit-ORing is shown in Figure 26.

70

Perfecting Microinstructions:
One drawback of micro-programmed control is that it leads to a slower operating speed because of the time it takes to fetch microinstructions from the control store. Faster operation is achieved if the next microinstruction is pre-fetched while the current one is being executed. In this way, the execution time can be overlapped with the fetch time. Pre-fetching microinstructions presents some organizational difficulties. Sometimes the status flags and the results of the currently executed microinstruction are needed to determine the address of the next microinstruction. Thus, straightforward pre-fetching occasionally pre-fetches a wrong microinstruction. In these cases, the fetch must be repeated with the correct address, which requires more complex hardware. However, the disadvantages are minor, and the pre-fetching technique is often used.

Fig 25

71

Fig 26

72