Numerical Analysis

0.1.
BINARY NUMBERS 1
Chapter 0 Fundamentals
The primary goal of numerical analysis is to construct and explore algo-
rithms for solving science and engineering problems. These algorithms can
be developed into computer languages and carried out in computers. To un-
derstand algorithms well, we should know some things about how number
arithmetics, addition, subtraction, multiplication and division, are performed
in computers. This is a very basic issue in designing and programming algo-
rithms we need pay attention. Especially, knowing the details of computer
arithmetic makes us in a better position to understand potential pitfalls in
computer computation. Therefore, in this chapter we first introduce com-
puter arithmetic, which includes how numbers are stored and operated in
computer, how much is machine error, and how to avoid loss of significance.
Review of some results in calculus is also given in this chapter since these
results will be used in construction and analysis of algorithms.
0.1 Binary numbers

Binary numbers are used on computers. Before we introduce actual forms
of numbers and number operations on computers, we give the standard form
of a binary number and conversion between decimal numbers and binary
numbers.
Decimal numbers, for example, 1234.56, can be written as
1234.56 = 1×103 + 2×102 + 3×101 + 4×100 + 5×10−1 + 6×10−2
= 1000 + 200 + 30 + 4 + 0.5 + 0.06,
where digits 1, 2, 3, 4, 5, 6 and as well as 7, 8, 9, 0 are used to express decimal
numbers. Similarly, binary numbers are defined.
Definition 0.1. Binary numbers are expressed as: (· · · b2 b1 b0 .b−1 b−2 · · ·)2 ,
where each binary digit bi is 0 or 1. The base 10 equivalent to this number
is:
· · · + b2 22 + b1 21 + b0 20 + b−1 2−1 + b−2 2−2 + · · · .
Example 0.1.Convert the binary numbers: (a) (100.0)2 ; (b) −(0.11)2 ;

and (c) (100101.1011)2 into decimal numbers.
Solution. (a) Adding up the digits times powers of 2 leads to
(101.0)2 = 0 × 2−1 + 1 × 20 + 0 × 21 + 1 × 22 = 5;
2
(b) Adding up the digits after the point times the negative powers of 2
leads to
3
−(0.11)2 = −(1 × 2−1 + 1 × 2−2 ) = − ;
4
(c) Noting that
(100101)2 = 1 × 20 + 0 × 21 + 1 × 22 + 0 × 23 + 1 × 24 = 21,
11
(0.1011)2 = 1 × 2−1 + 0 × 2−2 + 1 × 2−3 + 1 × 2−4 = ,
16
we obtain
11 11
(100101.1011)2 = (100101)2 + (0.1011)2 = 21 + = 21 . 2
16 16
Theorem 0.1. Conversion of decimal integers to binary is obtained by

dividing the decimal number by 2 successively and recording the remainders
from the bottom to the top.
Example 0.2. Convert the decimal number 53 into a binary number.
Solution. Dividing 53 by 2 successively and recording the remainders as
follows:
53 ÷ 2 = 26R 1
26 ÷ 2 = 13R 0
13 ÷ 2 = 6R 1
.
6 ÷ 2 = 3R 0
3 ÷ 2 = 1R 1
1 ÷ 2 = 0R 1
When the quotient becomes 0, the process stops, since the additional
equation 0 ÷ 2 = 0R0 is trivial. Then, the binary number is obtained by
writing the remainders (binary digits) from the bottom to the top: 53 =
(110101)2 . 2
Theorem 0.2. Decimal fractions are converted to binary numbers by
multiplying the decimal fraction and the resulted fractions by 2 successively,
recording the integer parts from the top to the bottom.
Example 0.3. Convert the decimal fractions: 0.7 and 0.625 to the binary
numbers.
Solution. Multiplying 0.7 by 2 leads to an integer 1 and a fraction 0.4,
recording the integer 1 and multiplying the resulted fraction 0.4 by 2 leads
0.2. FLOATING-POINT NUMBERS AND ROUND-OFF ERRORS 3
to an integer number 0 and a fraction 0.8. Keeping doing this process a

sequence of binary digits is obtained. The process is like as follows:
0.7 × 2 = 0.4 + 1
0.4 × 2 = 0.8 + 0
0.8 × 2 = 0.6 + 1
0.6 × 2 = 0.2 + 1
0.2 × 2 = 0.4 + 0
0.4 × 2 = 0.8 + 0
.. ..
. .
When the new fractional part becomes 0, the process stops, otherwise, the
process goes on. The digits are written from top to bottom, 0.7 = (0.10110)2 ,
will be the binary fraction. The symbol 0110 means the four digits will repeat
for ever.
Similarly, we have 0.625 = (0.101)2 . 2
Example 0.4. Convert the number 53.7 to the binary number.
Solution. Combining Examples 0.2 and 0.3 yields
53.7 = 53 + 0.7 = (110101)2 + (0.10110)2 = (110101.1 0110 0110 · · ·)2

= (110101.10110)2 ,
which is a repeated binary number. The bar over the digits means the four
digits are infinitely repeated. 2
Example 0.5. Suppose x = (0.1011)2 . Convert it to decimal.
Solution. Since x = 0000.1011 and 24 x = 1011.1011, subtracting x from
2 x yields (24 − 1)x = (1011)2 = (11)10 = 11. Thus, x = (0.1011)2 = 11
4
15
.2
0.2 Floating-point numbers and round-off er-

rors
Numbers stored on computers are in binary forms. In this section we
introduce forms of binary numbers stored on computers, rounding off errors
between binary numbers and values stored on computers.
First we review scientific notations for decimal numbers.
4
Remark 0.1. Scientific notation (also referred to as "standard form" or

"standard index form") is a way of writing numbers that are too big or too
small to be conveniently written in decimal form.
In scientific notation all numbers are written in the form a × 10b (a times
ten raised to the power of b), where the exponent b is an integer, and the
coefficient a is any real number, called the significand or mantissa. If the
number is negative then a minus sign precedes a (as in ordinary decimal
notation).
In normalized scientific notation, the exponent b is chosen so that the
absolute value of a remains at least one but less than ten (1 ≤ |a| < 10). For
example,
Decimal notation Scientific notation

2 2 × 100
300 3 × 102
4, 321.768 4.321768 × 103
. 2
−53000 −5.3 × 104
6, 720, 000, 000 6.72 × 109
0.2 2 × 10−1
0.00000000751 7.51 × 10−9
As the above examples show that scientific forms of decimal numbers

allow easy comparison of numbers, as the exponent b gives the number’s order
of magnitude, and that in normalized notation, the exponent b is negative
for a number with absolute value between 0 and 1 (e.g. 0.5 is written as
5 × 10−1 ). The 10 and exponent are often omitted when the exponent is 0.
Similar to scientific notations for decimal numbers, the binary numbers
in Examples 0.2 and 0.4 can be rewritten as
(100.0)2 = (1.00)2 × 22 ;
−(0.11)2 = −(1.1)2 × 2−1 ;
(100101.1011)2 = (1.001011011)2 × 25 ;
(110101.10110)2 = (1.1010110110)2 × 25 .
In the above transformation of binary forms, power of 2 will be added m if

binary point is moved m places on the left. On the other hand, power of 2
will be subtracted by m if binary point is moved m places on the right.
0.2. FLOATING-POINT NUMBERS AND ROUND-OFF ERRORS 5
As shown above, some binary numbers have finite binary digits, but many
of them have infinite number of binary digits. Those binary numbers with
infinite many binary digits are stored on computers as binary numbers with
a fixed number of binary digits.
For example, on computers with 32-bit accuracy(called single precision),
the binary number r = (1.1010110110)2 × 25 will be stored to be
r′ = (1.10101101100110011001101)2 × 25 ,
where there are 23 significant digits. The 23rd bit to the right of the binary
point is 0, then round down(truncate after the 22nd bit). If the 23rd bit is
1, then round up (add 1 to the 22nd bit).
Binary number r′ is called the binary floating-point representation of r,
where 1.10101101100110011001101 is called the mantissa, 5 is the exponent.
Binary number r′ approximates r and has the same 23 significant digits as r.
The form of representation like r′ was established by the Institute of
Electrical and Electronics Engineers. The standard is called IEEE754 Float-
ing Point Standard, which consists of a set of binary representations of real
numbers.
Definition 0.2. A floating point number has three parts: the sign (+ or
-), a mantissa, which contains the string of significant bits, and an exponent.
The form of a normalized floating point number is: ±1.bbb... × 2p , where
b = 0 or 1, p is an M -bit binary number representing the exponent.
Remark 0.2. (a) Nomalization means that the leading or leftmost bit
must be the digit 1;
(b) A floating-point number is a rational number, because it has a finite
number of digits and can be represented as one integer divided by another.
For example (1.1011)2 × 22 is (110.11)2 , which is equal to
3 27
1 × 2 + 1 × 22 + 1 × 2−1 + 1 × 2−2 = 6 + = .
4 4
(c) In a normalized floating point number, sign, exponent and mantissa
are stored together in a computer word, sign exponent mantissa .
For example, the IEEE floating-point representation of the real number
(r = 53.7) in Example 0.4 by single precision is
r′ = 1.101011 0110 0110 0110 0110 × 25 .
The word for storing r′ is
6
0 0 0000101 1010110110011001100110 .
(d) The lengths of the significand and exponent determine the precision
to which numbers can be represented. There are three commonly used levels
of precision for floating point numbers, single precision, double precision, and
long double precision. The numbers of bits allocated for the three levels are:
32, 64 and 80. The details of the standards for the representation in the
three levels of precision are shown below.
precision sign exponent mantissa
single 1 8 23
. 2
double 1 11 52
long double 1 15 64
Definition 0.3(round-off error). Most real numbers have to be rounded

off in order to be represented as t-digit floating point numbers. The difference
between the floating point number x′ and the original number x is called the
round-off error.
Definition 0.4 (absolute and relative error). If x is a real number and
x′ is its floating-point approximation, then the difference x′ − x is called the
absolute error and the quotient (x′ − x)/x is called the relative error.
When arithmetic operations are applied to floating-point numbers, ad-
ditional round-off errors may occur. Next section we introduce arithmetic
operations for binary numbers.
0.3 Operations of floating-point numbers

The four operations of binary numbers are addition, subtraction, multi-
plication, division. The simplest arithmetic operation in binary is addition.
For example,
0 1 1 0 1
+ 1 0 1 1 1
= 36.
− − − − − − −
= 1 0 0 1 0 0
Adding two single-digit binary numbers is relatively simple, using a form
of carrying: 0 + 0 → 0; 0 + 1 → 1; 1 + 0 → 1; 1 + 1 → 0, carry 1 (since
1 + 1 = 2 = (10)2 = 0 + 1 × 21 ). Adding two "1" digits produces a digit
0.3. OPERATIONS OF FLOATING-POINT NUMBERS 7
"0", while 1 will have to be added to the next column. This is similar
to what happens in decimal when certain single-digit numbers are added
together; if the result equals or exceeds the value of the radix (10), the digit
to the left is incremented. This is known as carrying. When the result of an
addition exceeds the value of a digit, the procedure is to "carry" the excess
amount divided by the radix (that is, 10/10) to the left, adding it to the next
positional value.
Subtraction works in much the same way as addition: 0−0 → 0; 0−1 → 1,
borrow 1; 1 − 0 → 1; 1 − 1 → 0. Subtracting a "1" digit from a "0"
digit produces the digit "1", while 1 will have to be subtracted from the
next column. This is known as borrowing. The principle is the same as
for carrying. When 0 − 1 happens, the procedure is to "borrow" the deficit
divided by the radix from the left, subtracting it from the next positional
value. For example,
1 1
0 1 1 1 0
− 1 0 1 1 1
− − − − − − − −
= 1 0 1 0 1 1 1
Multiplication in binary is similar to its decimal counterpart. Two num-

bers a and b can be multiplied by partial products: for each digit in b, the
product of that digit and a is calculated and written on a new line, shifted
leftward so that its rightmost digit lines up with the digit in b that was used.
The sum of all these partial products gives the final result.
Since there are only two digits in binary, there are only two possible
outcomes of each partial multiplication: (i) If the digit in b is 0, the partial
product is also 0; (ii) If the digit in b is 1, the partial product is equal to a.
For example, the binary numbers (1011)2 and (1010)2 are multiplied as
follows:
1 0 1 1 (= a)
× 1 0 1 0 (= b)
− − − − − − − − − −
0 0 0 0
1 0 1 1
0 0 0 0
+ 1 0 1 1
− − − − − − − − − −
= 1 1 0 1 1 1 0
8
Binary numbers can also be multiplied with bits after a binary point
where the binary point will be moved on the left.
For example, the multiplication (101.101)2 × (110.01)2 is carried out as
follows:
1 0 1 . 1 0 1
× 1 1 0 . 0 1
− − − − − − − − − − − −
1 . 0 1 1 0 1
0 0 . 0 0 0 0
0 0 0 . 0 0 0
1 0 1 1 . 0 1
+ 1 0 1 1 0 . 1
− − − − − − − − − − − −
=1 0 0 0 1 1 . 0 0 1 0 1
Binary division is again similar to its decimal counterpart. For example,

compute (11011)2 ÷ (101)2 . Here, the divisor is (101)2 , or 5 decimal, while
the dividend is (11011)2 , or 27 decimal. The procedure is the same as that
of decimal division; here, the divisor (101)2 goes into the first three digits
110 of the dividend one time, so a "1" is written on the top line. This result
is multiplied by the divisor, and subtracted from the first three digits of
the dividend; the next digit (a "1") is included to obtain a new three-digit
sequence. The procedure is then repeated with the new sequence, continuing
until the digits in the dividend have been exhausted:
1 0 1
−
) − − − − −
1 0 1 1 1 0 1 1
− 1 0 1
− − − − −
0 1 1
− 0 0 0
− − − − −
1 1 1
− 1 0 1
− − − − −
1 0
Thus, (11011)2 = (101)2 · (101)2 + (10)2 , the remainder is (10)2 ; in decimal
form, it is written as: 27 = 5 · 5 + 2.

As we know that most real numbers have to be rounded off in order to
be represented as t-digit floating point numbers on stored computers and
that the difference between the floating point number x′ and the original
number x is the round-off error. When arithmetic operations are applied to
floating-point numbers, additional round-off errors may occur.
Definition 0.5. (machine addition of floating point numbers). Machine
addition consists of lining up the decimal points of the two numbers to be
added, adding them, and then storing the result again as a floating point
number.
For example, adding 1 to 2−23 would appear as follows:
1. 00...0 × 20 + 1. 00...0 × 2−23
1. 00000000000000000000000 × 20
=
+ 0. 00000000000000000000000 1 × 20
= 1. 0000000000000000000000 1 × 20 .
This sum is saved as 1.0 × 20 = 1. From this example we see that if a big
number adds a very small number, the result would be the same as the big
number.
Definition 0.6 (loss of significance). When two nearly equal numbers
are subtracted, significant digits are lost. this phenomenon is called loss of
significance.
For example, we use seven significant digits to do the subtraction: 113.4567−
113.4566:
1 1 3 . 4 5 6 7
− 1 1 3 . 4 5 6 6
− − − − − − − − −
= 0 0 0 . 0 0 0 1
Two input numbers have seven-digit accuracy, but after subtraction the result
has only one-digit accuracy. This operation loses many significant digits. In
programming and computation by a computer, loss of significance should be
avoided by restructuring the calculation and reducing operation counts.
10
Example 0.8. To avoid loss of significance, rewrite the following expres-

sions.
√ 1 − cos(x) 1
(a) 9.01 − 3; (b) 1 − cos(0.001); (c) 2 = .
sin (x) 1 + cos(x)
Solution. In order to avoid loss of significance, we rewrite expression of

subtraction of two nearly equal numbers in (a), (b) and (c) as follows.
√
√ 9.01 − 3 √ 0.01
(a) 9.01 − 3 = √ ( 9.01 + 3) = √ ;
9.01 + 3 9.01 + 3
(b) 1 − cos(0.001) = 1 − (1 − 2 sin2 (0.0005)) = 2 sin2 (0.0005);
1 − cos(x) 1 − cos2 (x) 1
(c) 2 = 2 = .
sin (x) (1 + cos(x)) sin (x) 1 + cos(x)
Example 0.9. Give the roots of the equation x2 + 912 x − 3 = 0 and pay
attention to loss of significance.
Solution. By the quadratic formula for solutions of quadratic equations,
we see that the solution x1 and x2 are
√ √
−912 − 924 + 12 −912 + 924 + 12
x1 = , and x2 = .
2 2
√
Note that 912 and √924 + 12 are nearly equal to each other, and that
the calculation −912 + 924 + 12 by computers will lead to loss of signif-
icance. √Then, by multiplying the numerator and denominator of x2 by
−912 − 924 + 12, we have
6
x2 = √ .
912 + 924 + 12
Thus the formulas used by computers or developed into programs are:
√
−912 − 924 + 12 6
x1 = , and x2 = √ . 2
2 9 + 924 + 12
12
√
This example shows that the quadratic formula x1,2 = (−b± b2 − 4ac)/(2a)
for the equation ax2 − bx + c = 0 must be used with care in cases where |b| is
√
very close to b2 − 4ac, that is to say if a and or c are very small compared
with b, in this case, b2 − 4ac is nearly equal to b2 , then one of the roots in
the traditional expressions is subject to loss of significance. If b is positive in
this situation, the roots should be computed by
√
−b − b2 − 4ac −2c
x1 = , and x2 = √ ;
2a b + b2 − 4ac
And if b is negative and b2 − 4ac very close to b2 , then, the roots are best
computed by
√
−b + b2 − 4ac 2c
x1 = , and x2 = √ .
2a −b + b2 − 4ac
If we use the quadratic formula to develop a program to solve a quadratic
equation, the above expressions should be considered and used.
Example 0.10. Consider the number of additions and multiplications
required to evaluate a polynomial p(x) at x = 0.5 by the standard form:
p(x) = c0 +c1 x+c2 x2 +c3 x3 and the nested form: p(x) = c0 +x[c1 +x(c2 +c3 x)]
(called nested multiplication).
Solution. By counting the operations in the calculation: p(0.5) = c0 +
c1 · 0.5 + c2 · 0.52 + c3 · 0.53 , we need 3 additions and 6 multiplications, while
if we use p(0.5) = c0 + 0.5 · [c1 + 0.5 · (c2 + c3 · 0.5)] and evaluate from inside
out, 3 additions and 3 multiplications are required. 2
By this example we see that when a polynomial especially a high de-
gree one is evaluated, use of the nested form will save the count number of
multiplication and avoid loss of significance.
Exercise
1. Find the binary representation of the decimal numbers:
(a) 8; (b) 64; (c) 17; (d) 1/8; (e) 35/16.
2. Convert the following binary numbers to decimal numbers.
(a) (10111)2 ; (b) (0.1001)2 ; (c) (1101.101)2 .
3. Convert the following base 10 numbers to binary and express each as
a floating point number in single precision by using the Rounding to Nearest
Rule:
12
(a) 37; (b) 1/4; (c) 1/3 (d) 9.5; (e) 25.75.
4. (a) Suppose x = (0.1001)2 . Convert this binary number to decimal;
(b) use the result in (a) to convert (1.11001)2 to decimal number.
5. Identify for which values of x there is subtraction of nearly equal
numbers, and find an alternate form that avoid the problem.
1 − sec x 1 − (1 − x)3 1 1
(a) ; (b) ; (c) − .
2
tan x x 1+x 1−x
6. Explain how to most accurately compute the two roots of the equation
x + bx − 10−12 = 0, where b is a number greater than 100.
2

Numerical Analysis

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Numerical Analysis

Diunggah oleh

Hak Cipta:

Format Tersedia

0.1.

0.1 Binary numbers

Example 0.1.Convert the binary numbers: (a) (100.0)2 ; (b) −(0.11)2 ;

Theorem 0.1. Conversion of decimal integers to binary is obtained by

to an integer number 0 and a fraction 0.8. Keeping doing this process a

53.7 = 53 + 0.7 = (110101)2 + (0.10110)2 = (110101.1 0110 0110 · · ·)2

0.2 Floating-point numbers and round-oﬀ er-

Remark 0.1. Scientiﬁc notation (also referred to as "standard form" or

Decimal notation Scientiﬁc notation

As the above examples show that scientiﬁc forms of decimal numbers

In the above transformation of binary forms, power of 2 will be added m if

Deﬁnition 0.3(round-oﬀ error). Most real numbers have to be rounded

0.3 Operations of ﬂoating-point numbers

Multiplication in binary is similar to its decimal counterpart. Two num-

Binary division is again similar to its decimal counterpart. For example,

form, it is written as: 27 = 5 · 5 + 2.

1. 00...0 × 20 + 1. 00...0 × 2−23

Example 0.8. To avoid loss of signiﬁcance, rewrite the following expres-

Solution. In order to avoid loss of signiﬁcance, we rewrite expression of

Anda mungkin juga menyukai