BINARY NUMBERS 1
Chapter 0 Fundamentals
The primary goal of numerical analysis is to construct and explore algo-
rithms for solving science and engineering problems. These algorithms can
be developed into computer languages and carried out in computers. To un-
derstand algorithms well, we should know some things about how number
arithmetics, addition, subtraction, multiplication and division, are performed
in computers. This is a very basic issue in designing and programming algo-
rithms we need pay attention. Especially, knowing the details of computer
arithmetic makes us in a better position to understand potential pitfalls in
computer computation. Therefore, in this chapter we first introduce com-
puter arithmetic, which includes how numbers are stored and operated in
computer, how much is machine error, and how to avoid loss of significance.
Review of some results in calculus is also given in this chapter since these
results will be used in construction and analysis of algorithms.
(b) Adding up the digits after the point times the negative powers of 2
leads to
3
−(0.11)2 = −(1 × 2−1 + 1 × 2−2 ) = − ;
4
(c) Noting that
(100101)2 = 1 × 20 + 0 × 21 + 1 × 22 + 0 × 23 + 1 × 24 = 21,
11
(0.1011)2 = 1 × 2−1 + 0 × 2−2 + 1 × 2−3 + 1 × 2−4 = ,
16
we obtain
11 11
(100101.1011)2 = (100101)2 + (0.1011)2 = 21 + = 21 . 2
16 16
0.7 × 2 = 0.4 + 1
0.4 × 2 = 0.8 + 0
0.8 × 2 = 0.6 + 1
0.6 × 2 = 0.2 + 1
0.2 × 2 = 0.4 + 0
0.4 × 2 = 0.8 + 0
.. ..
. .
When the new fractional part becomes 0, the process stops, otherwise, the
process goes on. The digits are written from top to bottom, 0.7 = (0.10110)2 ,
will be the binary fraction. The symbol 0110 means the four digits will repeat
for ever.
Similarly, we have 0.625 = (0.101)2 . 2
Example 0.4. Convert the number 53.7 to the binary number.
Solution. Combining Examples 0.2 and 0.3 yields
which is a repeated binary number. The bar over the digits means the four
digits are infinitely repeated. 2
Example 0.5. Suppose x = (0.1011)2 . Convert it to decimal.
Solution. Since x = 0000.1011 and 24 x = 1011.1011, subtracting x from
2 x yields (24 − 1)x = (1011)2 = (11)10 = 11. Thus, x = (0.1011)2 = 11
4
15
.2
(100.0)2 = (1.00)2 × 22 ;
−(0.11)2 = −(1.1)2 × 2−1 ;
(100101.1011)2 = (1.001011011)2 × 25 ;
(110101.10110)2 = (1.1010110110)2 × 25 .
As shown above, some binary numbers have finite binary digits, but many
of them have infinite number of binary digits. Those binary numbers with
infinite many binary digits are stored on computers as binary numbers with
a fixed number of binary digits.
For example, on computers with 32-bit accuracy(called single precision),
the binary number r = (1.1010110110)2 × 25 will be stored to be
r′ = (1.10101101100110011001101)2 × 25 ,
where there are 23 significant digits. The 23rd bit to the right of the binary
point is 0, then round down(truncate after the 22nd bit). If the 23rd bit is
1, then round up (add 1 to the 22nd bit).
Binary number r′ is called the binary floating-point representation of r,
where 1.10101101100110011001101 is called the mantissa, 5 is the exponent.
Binary number r′ approximates r and has the same 23 significant digits as r.
The form of representation like r′ was established by the Institute of
Electrical and Electronics Engineers. The standard is called IEEE754 Float-
ing Point Standard, which consists of a set of binary representations of real
numbers.
Definition 0.2. A floating point number has three parts: the sign (+ or
-), a mantissa, which contains the string of significant bits, and an exponent.
The form of a normalized floating point number is: ±1.bbb... × 2p , where
b = 0 or 1, p is an M -bit binary number representing the exponent.
Remark 0.2. (a) Nomalization means that the leading or leftmost bit
must be the digit 1;
(b) A floating-point number is a rational number, because it has a finite
number of digits and can be represented as one integer divided by another.
For example (1.1011)2 × 22 is (110.11)2 , which is equal to
3 27
1 × 2 + 1 × 22 + 1 × 2−1 + 1 × 2−2 = 6 + = .
4 4
(c) In a normalized floating point number, sign, exponent and mantissa
are stored together in a computer word, sign exponent mantissa .
For example, the IEEE floating-point representation of the real number
(r = 53.7) in Example 0.4 by single precision is
r′ = 1.101011 0110 0110 0110 0110 × 25 .
The word for storing r′ is
6
0 0 0000101 1010110110011001100110 .
(d) The lengths of the significand and exponent determine the precision
to which numbers can be represented. There are three commonly used levels
of precision for floating point numbers, single precision, double precision, and
long double precision. The numbers of bits allocated for the three levels are:
32, 64 and 80. The details of the standards for the representation in the
three levels of precision are shown below.
precision sign exponent mantissa
single 1 8 23
. 2
double 1 11 52
long double 1 15 64
"0", while 1 will have to be added to the next column. This is similar
to what happens in decimal when certain single-digit numbers are added
together; if the result equals or exceeds the value of the radix (10), the digit
to the left is incremented. This is known as carrying. When the result of an
addition exceeds the value of a digit, the procedure is to "carry" the excess
amount divided by the radix (that is, 10/10) to the left, adding it to the next
positional value.
Subtraction works in much the same way as addition: 0−0 → 0; 0−1 → 1,
borrow 1; 1 − 0 → 1; 1 − 1 → 0. Subtracting a "1" digit from a "0"
digit produces the digit "1", while 1 will have to be subtracted from the
next column. This is known as borrowing. The principle is the same as
for carrying. When 0 − 1 happens, the procedure is to "borrow" the deficit
divided by the radix from the left, subtracting it from the next positional
value. For example,
1 1
0 1 1 1 0
− 1 0 1 1 1
− − − − − − − −
= 1 0 1 0 1 1 1
Binary numbers can also be multiplied with bits after a binary point
where the binary point will be moved on the left.
For example, the multiplication (101.101)2 × (110.01)2 is carried out as
follows:
1 0 1 . 1 0 1
× 1 1 0 . 0 1
− − − − − − − − − − − −
1 . 0 1 1 0 1
0 0 . 0 0 0 0
0 0 0 . 0 0 0
1 0 1 1 . 0 1
+ 1 0 1 1 0 . 1
− − − − − − − − − − − −
=1 0 0 0 1 1 . 0 0 1 0 1
1. 00000000000000000000000 × 20
=
+ 0. 00000000000000000000000 1 × 20
= 1. 0000000000000000000000 1 × 20 .
This sum is saved as 1.0 × 20 = 1. From this example we see that if a big
number adds a very small number, the result would be the same as the big
number.
Definition 0.6 (loss of significance). When two nearly equal numbers
are subtracted, significant digits are lost. this phenomenon is called loss of
significance.
For example, we use seven significant digits to do the subtraction: 113.4567−
113.4566:
1 1 3 . 4 5 6 7
− 1 1 3 . 4 5 6 6
− − − − − − − − −
= 0 0 0 . 0 0 0 1
Two input numbers have seven-digit accuracy, but after subtraction the result
has only one-digit accuracy. This operation loses many significant digits. In
programming and computation by a computer, loss of significance should be
avoided by restructuring the calculation and reducing operation counts.
10
√
√ 9.01 − 3 √ 0.01
(a) 9.01 − 3 = √ ( 9.01 + 3) = √ ;
9.01 + 3 9.01 + 3
(b) 1 − cos(0.001) = 1 − (1 − 2 sin2 (0.0005)) = 2 sin2 (0.0005);
1 − cos(x) 1 − cos2 (x) 1
(c) 2 = 2 = .
sin (x) (1 + cos(x)) sin (x) 1 + cos(x)
Example 0.9. Give the roots of the equation x2 + 912 x − 3 = 0 and pay
attention to loss of significance.
Solution. By the quadratic formula for solutions of quadratic equations,
we see that the solution x1 and x2 are
√ √
−912 − 924 + 12 −912 + 924 + 12
x1 = , and x2 = .
2 2
√
Note that 912 and √924 + 12 are nearly equal to each other, and that
the calculation −912 + 924 + 12 by computers will lead to loss of signif-
icance. √Then, by multiplying the numerator and denominator of x2 by
−912 − 924 + 12, we have
6
x2 = √ .
912 + 924 + 12
Thus the formulas used by computers or developed into programs are:
√
−912 − 924 + 12 6
x1 = , and x2 = √ . 2
2 9 + 924 + 12
12
√
This example shows that the quadratic formula x1,2 = (−b± b2 − 4ac)/(2a)
for the equation ax2 − bx + c = 0 must be used with care in cases where |b| is
0.3. OPERATIONS OF FLOATING-POINT NUMBERS 11
√
very close to b2 − 4ac, that is to say if a and or c are very small compared
with b, in this case, b2 − 4ac is nearly equal to b2 , then one of the roots in
the traditional expressions is subject to loss of significance. If b is positive in
this situation, the roots should be computed by
√
−b − b2 − 4ac −2c
x1 = , and x2 = √ ;
2a b + b2 − 4ac
And if b is negative and b2 − 4ac very close to b2 , then, the roots are best
computed by
√
−b + b2 − 4ac 2c
x1 = , and x2 = √ .
2a −b + b2 − 4ac
If we use the quadratic formula to develop a program to solve a quadratic
equation, the above expressions should be considered and used.
Example 0.10. Consider the number of additions and multiplications
required to evaluate a polynomial p(x) at x = 0.5 by the standard form:
p(x) = c0 +c1 x+c2 x2 +c3 x3 and the nested form: p(x) = c0 +x[c1 +x(c2 +c3 x)]
(called nested multiplication).
Solution. By counting the operations in the calculation: p(0.5) = c0 +
c1 · 0.5 + c2 · 0.52 + c3 · 0.53 , we need 3 additions and 6 multiplications, while
if we use p(0.5) = c0 + 0.5 · [c1 + 0.5 · (c2 + c3 · 0.5)] and evaluate from inside
out, 3 additions and 3 multiplications are required. 2
By this example we see that when a polynomial especially a high de-
gree one is evaluated, use of the nested form will save the count number of
multiplication and avoid loss of significance.
Exercise
1. Find the binary representation of the decimal numbers:
(a) 8; (b) 64; (c) 17; (d) 1/8; (e) 35/16.
2. Convert the following binary numbers to decimal numbers.
(a) (10111)2 ; (b) (0.1001)2 ; (c) (1101.101)2 .
3. Convert the following base 10 numbers to binary and express each as
a floating point number in single precision by using the Rounding to Nearest
Rule:
12
(a) 37; (b) 1/4; (c) 1/3 (d) 9.5; (e) 25.75.
4. (a) Suppose x = (0.1001)2 . Convert this binary number to decimal;
(b) use the result in (a) to convert (1.11001)2 to decimal number.
5. Identify for which values of x there is subtraction of nearly equal
numbers, and find an alternate form that avoid the problem.
1 − sec x 1 − (1 − x)3 1 1
(a) ; (b) ; (c) − .
2
tan x x 1+x 1−x
6. Explain how to most accurately compute the two roots of the equation
x + bx − 10−12 = 0, where b is a number greater than 100.
2