Probe

REGRESSION AND CORRELATION
Regression
A major objective of many statistical investigations is to establish relationships that make

it possible to predict one or more independent variables in terms of others. Thus studies
are made to predict the potential sales of a new product in terms of he money spent on
advertising, the patient’s weight in terms of the number of weeks he/she has been on a
diet, the marks obtained by a student in terms of the number of classes he attended, etc.
Although it is desirable to predict the quantity exactly in terms of the others, this is
seldom possible and in most cases, we have to be satisfied with predicting average or
expected values. Thus we would like to predict the average sales in terms of the money
spent on advertising, the average income of a college student in terms of the number of
years he/she has been out of the college.
Thus given two random variables, X, Y and given that X takes th value x, the basic
problem of bivariate regression is to determine the conditional expected value E(Y|x) as a
function of x. In most cases, we may find that E(Y|x) is a linear function of x:
E(Y|x) = α + βx, where the constants α , β are called the regression coefficients.
Denoting E(X) = µ1, E(Y) = µ2, Var (X ) = σ1, Var (Y ) = σ2, cov(X,Y) = σ12, ρ =
σ 12
, we can show:
σ 1σ 2
Theorem: (a) If the regression of Y on X is linear, then

σ2
E(Y|x) = µ2 + ρ (x -µ1)
σ1
(b) If the regression of X on Y is linear, then
σ1
E(X|y) = µ1 + ρ (y -µ2)
σ2
Note: ρ is called the correlation coefficient between X and Y.
In actual situations, we have to “estimate” the regression coefficients α , β from a random

sample { (x1,y1), (x2, y2), … (xn, yn)} of size n from the 2-dimensional random variable
(X, Y). We now “fit” a straight line y = a + bx for the above data by the method of ”least
164
squares”. The method of least squares says that choose constants a and b for which the
sum of the squares of the “vertical deviations” of the sample points (xi, yi) from the line y
n
= a+bx is a minimum. I.e. find a, b so that T = [ y i − (a + bxi )] 2 is a minimum. Using
i =1
∂T ∂T
2-variable calculus, we should determine a, b so that = 0 and = 0. Thus we get
∂a ∂b
n n
the following two equations (−2) [yi – (a + bxi)] = 0 and ( -2xi) [yi – (a + bxi)] = 0.
i =1 i =1
Simplifying, we get the so called “normal equations”:
n n
na + ( xi )b = yi
i =1 i =1
n n n
( xi )a + ( xi2 )b = ( xi y i )
i =1 i =1 i =1
n n n n n
n( xi y i ) − ( xi ) ( yi ) ( yi ) − ( xi ) b
Solving we get b= i =1
n
i =1
n
i =1
; a= i =1 i =1
.
n
n( x )−(
2
i xi ) 2
i =1 i =1
These constants a and b are used to estimate the unknown regression coefficients α , β.
Now if x = xg, we predict y as yg = a + bxg.
Problem 1.
Various doses of a poisonous substance were given to groups of 25 mice and the
following results were observed:
Dose (mg) Number of deaths

x y
4 1
6 3
8 6
10 8
12 14
14 16
16 20
165
(a) Find the equation of the least squares line fit to these data
(b) Estimate the number of deaths in a group of 25 mice who receive a 7 mg dose of
this poison.
Solution:
(a) n = number of sample pairs (xi, yi) = 7
xi = 70, yi = 68 xi2 = 812, xi yi = 862
Hence b = {7 x 862 – 70 x 68 } / { 7 x 812 – (70)2 } = 1274/784 = 1.625

a = {68 – 70 x 1.625}/7 = - 6.536
Thus the least square line that fits the given data is: y = -6.536 + 1.625 x
(b) If x = 7, y = -6.536 + 1.625 x 7 = 4.839.
Problem 2:
The following are the scores that 12 students obtained in the midterm and final
examinations in a course in Statistics:
Mid Term Examination Final Examination

x y
71 83
49 62
80 76
73 77
93 89
85 74
58 48
82 78
64 76
32 51
87 73
80 89
166
(a) Fit a straight line to the above data
(b) Hence predict the final exam score of a student who received a score of 84 in the
midterm examination.
Solution:
(a) n = number of sample pairs (xi, yi) = 12
xi = 854, yi = 876 xi2 = 64222, xi yi = 64346
Hence b = {12 x 64346 – 854 x 876 } / { 12 x 64222 – (854)2 } = 24048/41348 = 0.5816

a = {876 – 854 x 0.5816}/12 = 31.609
Thus the least square line that fits the given data is: y = 31.609 + 0.5816 x
(b) If x = 84, y = 31.609 + 0.5816 x 84 = 80.46
Correlation
If X, Y are two random variables, the correlation coefficient, ρ, between X and Y is

defined as
cov ( X , Y )
ρ= .
Var ( X ) Var (Y )
It can be shown that
(a) -1 ≤ ρ ≤ 1
(b) If Y is a linear function of X, ρ = ± 1
(c) If X and Y are independent, then ρ = 0
(d) If X, Y have bivariate normal distribution and if ρ = 0, then X and Y are
independent.
Sample Correlation Coefficient
If { (x1,y1), (x2, y2), … (xn, yn)} is a random sample of size n from the 2-dimensional
random variable (X, Y), then the sample correlation coefficient, r, is defined by
167
n
( xi − x ) ( y i − y )
r= i =1
.
n n
( xi − x ) 2
( yi − y ) 2
i =1 i =1
We shall use r to estimate the (unknown) population correlation coefficient ρ. If (X, Y)

has a bivariate normal distribution, we can show that the random variable,
1 1+ r 1 1+ ρ 1
Z = ln is approximately normal with mean ln and variance .
2 1− r 2 1− ρ n −3
S xy
Note: A computational formula for r is given by r = ,
S xx S yy
n n
n n
( xi ) 2 n n
( yi ) 2
where S xx = ( xi − x ) 2 = xi2 − i =1
, S xx = ( yi − y ) 2 = y i2 − i =1
,
i =1 i =1 n i =1 i =1 n
n n
n n
( xi ) ( yi )
S xy = ( xi − x ) ( y i − y ) = xi y i − i =1 i =1
.
i =1 i =1 n
Problem 3.
Calculate r for the data { (8, 3), (1, 4), (5, 0), (4, 2), (7, 1) }.
Solution
x = 25/5 = 5. y = 10/5 = 2.
n
( xi − x ) ( y i − y ) = 3 x 1 + (-4) x 2 + 0 x (-2) + (-1) x 0 + 2 x (-1) = -7
i =1
n
( xi − x ) 2 = 9 + 16 + 0 + 1 + 4 = 30
i =1
n
( y i − y ) 2 = 1 + 4 + 4 + 0 + 1 = 10
i =1
−7
Hence r = = - 0.404.
(30) (10)
168
Problem 4.
The following are the measurements of the air velocity and evaporation coefficient of
burning fuel droplets in an impulse engine:
Air velocity Evaporation Coefficient

x y
20 0.18
60 0.37
100 0.35
140 0.78
180 0.56
220 0.75
260 1.18
300 1.30
340 1.17
380 1.65
Find the sample correlation coefficient, r.
n n
( xi ) 2
Solution. S xx = ( xi − x ) 2 = xi2 − i =1
= 532000 – (2000)2 /10 = 132000
i =1 i =1 n
n n
( yi ) 2
S xx = ( yi − y ) 2 = y i2 − i =1
= 9.1097 – (8.35)2 /10 = 2.13745
i =1 i =1 n
n n
( xi ) ( yi )
n n
(2000) (8.35)
S xy = ( xi − x ) ( y i − y ) = xi y i − i =1 i =1
= 2175.4 –
i =1 i =1 n 10
= 505.4
S xy 505.4
Hence r = = = 0.9515.
S xx S yy (132000) (2.13745)
**************
169

Probe

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Probe

Diunggah oleh

Hak Cipta:

Format Tersedia

REGRESSION AND CORRELATION

A major objective of many statistical investigations is to establish relationships that make

Theorem: (a) If the regression of Y on X is linear, then

In actual situations, we have to “estimate” the regression coefficients α , β from a random

Simplifying, we get the so called “normal equations”:

Dose (mg) Number of deaths

(a) n = number of sample pairs (xi, yi) = 7

xi = 70, yi = 68 xi2 = 812, xi yi = 862

Hence b = {7 x 862 – 70 x 68 } / { 7 x 812 – (70)2 } = 1274/784 = 1.625

(b) If x = 7, y = -6.536 + 1.625 x 7 = 4.839.

Mid Term Examination Final Examination

(a) n = number of sample pairs (xi, yi) = 12

xi = 854, yi = 876 xi2 = 64222, xi yi = 64346

Hence b = {12 x 64346 – 854 x 876 } / { 12 x 64222 – (854)2 } = 24048/41348 = 0.5816

(b) If x = 84, y = 31.609 + 0.5816 x 84 = 80.46

If X, Y are two random variables, the correlation coefficient, ρ, between X and Y is

It can be shown that

Sample Correlation Coefficient

We shall use r to estimate the (unknown) population correlation coefficient ρ. If (X, Y)

Air velocity Evaporation Coefficient

Find the sample correlation coefficient, r.

Anda mungkin juga menyukai