Anda di halaman 1dari 6

REGRESSION AND CORRELATION

Regression

A major objective of many statistical investigations is to establish relationships that make


it possible to predict one or more independent variables in terms of others. Thus studies
are made to predict the potential sales of a new product in terms of he money spent on
advertising, the patient’s weight in terms of the number of weeks he/she has been on a
diet, the marks obtained by a student in terms of the number of classes he attended, etc.

Although it is desirable to predict the quantity exactly in terms of the others, this is
seldom possible and in most cases, we have to be satisfied with predicting average or
expected values. Thus we would like to predict the average sales in terms of the money
spent on advertising, the average income of a college student in terms of the number of
years he/she has been out of the college.

Thus given two random variables, X, Y and given that X takes th value x, the basic
problem of bivariate regression is to determine the conditional expected value E(Y|x) as a
function of x. In most cases, we may find that E(Y|x) is a linear function of x:

E(Y|x) = α + βx, where the constants α , β are called the regression coefficients.

Denoting E(X) = µ1, E(Y) = µ2, Var (X ) = σ1, Var (Y ) = σ2, cov(X,Y) = σ12, ρ =
σ 12
, we can show:
σ 1σ 2

Theorem: (a) If the regression of Y on X is linear, then


σ2
E(Y|x) = µ2 + ρ (x -µ1)
σ1
(b) If the regression of X on Y is linear, then
σ1
E(X|y) = µ1 + ρ (y -µ2)
σ2
Note: ρ is called the correlation coefficient between X and Y.

In actual situations, we have to “estimate” the regression coefficients α , β from a random


sample { (x1,y1), (x2, y2), … (xn, yn)} of size n from the 2-dimensional random variable
(X, Y). We now “fit” a straight line y = a + bx for the above data by the method of ”least

164
squares”. The method of least squares says that choose constants a and b for which the
sum of the squares of the “vertical deviations” of the sample points (xi, yi) from the line y
n
= a+bx is a minimum. I.e. find a, b so that T = [ y i − (a + bxi )] 2 is a minimum. Using
i =1

∂T ∂T
2-variable calculus, we should determine a, b so that = 0 and = 0. Thus we get
∂a ∂b
n n
the following two equations (−2) [yi – (a + bxi)] = 0 and ( -2xi) [yi – (a + bxi)] = 0.
i =1 i =1

Simplifying, we get the so called “normal equations”:

n n
na + ( xi )b = yi
i =1 i =1
n n n
( xi )a + ( xi2 )b = ( xi y i )
i =1 i =1 i =1

n n n n n
n( xi y i ) − ( xi ) ( yi ) ( yi ) − ( xi ) b
Solving we get b= i =1
n
i =1
n
i =1
; a= i =1 i =1
.
n
n( x )−(
2
i xi ) 2

i =1 i =1

These constants a and b are used to estimate the unknown regression coefficients α , β.
Now if x = xg, we predict y as yg = a + bxg.

Problem 1.

Various doses of a poisonous substance were given to groups of 25 mice and the
following results were observed:

Dose (mg) Number of deaths


x y
4 1
6 3
8 6
10 8
12 14
14 16
16 20

165
(a) Find the equation of the least squares line fit to these data
(b) Estimate the number of deaths in a group of 25 mice who receive a 7 mg dose of
this poison.

Solution:

(a) n = number of sample pairs (xi, yi) = 7

xi = 70, yi = 68 xi2 = 812, xi yi = 862

Hence b = {7 x 862 – 70 x 68 } / { 7 x 812 – (70)2 } = 1274/784 = 1.625


a = {68 – 70 x 1.625}/7 = - 6.536

Thus the least square line that fits the given data is: y = -6.536 + 1.625 x

(b) If x = 7, y = -6.536 + 1.625 x 7 = 4.839.

Problem 2:

The following are the scores that 12 students obtained in the midterm and final
examinations in a course in Statistics:

Mid Term Examination Final Examination


x y
71 83
49 62
80 76
73 77
93 89
85 74
58 48
82 78
64 76
32 51
87 73
80 89

166
(a) Fit a straight line to the above data
(b) Hence predict the final exam score of a student who received a score of 84 in the
midterm examination.

Solution:

(a) n = number of sample pairs (xi, yi) = 12

xi = 854, yi = 876 xi2 = 64222, xi yi = 64346

Hence b = {12 x 64346 – 854 x 876 } / { 12 x 64222 – (854)2 } = 24048/41348 = 0.5816


a = {876 – 854 x 0.5816}/12 = 31.609

Thus the least square line that fits the given data is: y = 31.609 + 0.5816 x

(b) If x = 84, y = 31.609 + 0.5816 x 84 = 80.46

Correlation

If X, Y are two random variables, the correlation coefficient, ρ, between X and Y is


defined as

cov ( X , Y )
ρ= .
Var ( X ) Var (Y )

It can be shown that

(a) -1 ≤ ρ ≤ 1
(b) If Y is a linear function of X, ρ = ± 1
(c) If X and Y are independent, then ρ = 0
(d) If X, Y have bivariate normal distribution and if ρ = 0, then X and Y are
independent.

Sample Correlation Coefficient

If { (x1,y1), (x2, y2), … (xn, yn)} is a random sample of size n from the 2-dimensional
random variable (X, Y), then the sample correlation coefficient, r, is defined by

167
n
( xi − x ) ( y i − y )
r= i =1
.
n n
( xi − x ) 2
( yi − y ) 2

i =1 i =1

We shall use r to estimate the (unknown) population correlation coefficient ρ. If (X, Y)


has a bivariate normal distribution, we can show that the random variable,
1 1+ r 1 1+ ρ 1
Z = ln is approximately normal with mean ln and variance .
2 1− r 2 1− ρ n −3
S xy
Note: A computational formula for r is given by r = ,
S xx S yy
n n

n n
( xi ) 2 n n
( yi ) 2
where S xx = ( xi − x ) 2 = xi2 − i =1
, S xx = ( yi − y ) 2 = y i2 − i =1
,
i =1 i =1 n i =1 i =1 n

n n

n n
( xi ) ( yi )
S xy = ( xi − x ) ( y i − y ) = xi y i − i =1 i =1
.
i =1 i =1 n

Problem 3.

Calculate r for the data { (8, 3), (1, 4), (5, 0), (4, 2), (7, 1) }.

Solution

x = 25/5 = 5. y = 10/5 = 2.
n
( xi − x ) ( y i − y ) = 3 x 1 + (-4) x 2 + 0 x (-2) + (-1) x 0 + 2 x (-1) = -7
i =1
n
( xi − x ) 2 = 9 + 16 + 0 + 1 + 4 = 30
i =1
n
( y i − y ) 2 = 1 + 4 + 4 + 0 + 1 = 10
i =1

−7
Hence r = = - 0.404.
(30) (10)

168
Problem 4.

The following are the measurements of the air velocity and evaporation coefficient of
burning fuel droplets in an impulse engine:

Air velocity Evaporation Coefficient


x y
20 0.18
60 0.37
100 0.35
140 0.78
180 0.56
220 0.75
260 1.18
300 1.30
340 1.17
380 1.65

Find the sample correlation coefficient, r.

n n
( xi ) 2
Solution. S xx = ( xi − x ) 2 = xi2 − i =1
= 532000 – (2000)2 /10 = 132000
i =1 i =1 n

n n
( yi ) 2
S xx = ( yi − y ) 2 = y i2 − i =1
= 9.1097 – (8.35)2 /10 = 2.13745
i =1 i =1 n

n n
( xi ) ( yi )
n n
(2000) (8.35)
S xy = ( xi − x ) ( y i − y ) = xi y i − i =1 i =1
= 2175.4 –
i =1 i =1 n 10
= 505.4

S xy 505.4
Hence r = = = 0.9515.
S xx S yy (132000) (2.13745)

**************

169

Anda mungkin juga menyukai