Regression
Although it is desirable to predict the quantity exactly in terms of the others, this is
seldom possible and in most cases, we have to be satisfied with predicting average or
expected values. Thus we would like to predict the average sales in terms of the money
spent on advertising, the average income of a college student in terms of the number of
years he/she has been out of the college.
Thus given two random variables, X, Y and given that X takes th value x, the basic
problem of bivariate regression is to determine the conditional expected value E(Y|x) as a
function of x. In most cases, we may find that E(Y|x) is a linear function of x:
E(Y|x) = α + βx, where the constants α , β are called the regression coefficients.
Denoting E(X) = µ1, E(Y) = µ2, Var (X ) = σ1, Var (Y ) = σ2, cov(X,Y) = σ12, ρ =
σ 12
, we can show:
σ 1σ 2
164
squares”. The method of least squares says that choose constants a and b for which the
sum of the squares of the “vertical deviations” of the sample points (xi, yi) from the line y
n
= a+bx is a minimum. I.e. find a, b so that T = [ y i − (a + bxi )] 2 is a minimum. Using
i =1
∂T ∂T
2-variable calculus, we should determine a, b so that = 0 and = 0. Thus we get
∂a ∂b
n n
the following two equations (−2) [yi – (a + bxi)] = 0 and ( -2xi) [yi – (a + bxi)] = 0.
i =1 i =1
n n
na + ( xi )b = yi
i =1 i =1
n n n
( xi )a + ( xi2 )b = ( xi y i )
i =1 i =1 i =1
n n n n n
n( xi y i ) − ( xi ) ( yi ) ( yi ) − ( xi ) b
Solving we get b= i =1
n
i =1
n
i =1
; a= i =1 i =1
.
n
n( x )−(
2
i xi ) 2
i =1 i =1
These constants a and b are used to estimate the unknown regression coefficients α , β.
Now if x = xg, we predict y as yg = a + bxg.
Problem 1.
Various doses of a poisonous substance were given to groups of 25 mice and the
following results were observed:
165
(a) Find the equation of the least squares line fit to these data
(b) Estimate the number of deaths in a group of 25 mice who receive a 7 mg dose of
this poison.
Solution:
Thus the least square line that fits the given data is: y = -6.536 + 1.625 x
Problem 2:
The following are the scores that 12 students obtained in the midterm and final
examinations in a course in Statistics:
166
(a) Fit a straight line to the above data
(b) Hence predict the final exam score of a student who received a score of 84 in the
midterm examination.
Solution:
Thus the least square line that fits the given data is: y = 31.609 + 0.5816 x
Correlation
cov ( X , Y )
ρ= .
Var ( X ) Var (Y )
(a) -1 ≤ ρ ≤ 1
(b) If Y is a linear function of X, ρ = ± 1
(c) If X and Y are independent, then ρ = 0
(d) If X, Y have bivariate normal distribution and if ρ = 0, then X and Y are
independent.
If { (x1,y1), (x2, y2), … (xn, yn)} is a random sample of size n from the 2-dimensional
random variable (X, Y), then the sample correlation coefficient, r, is defined by
167
n
( xi − x ) ( y i − y )
r= i =1
.
n n
( xi − x ) 2
( yi − y ) 2
i =1 i =1
n n
( xi ) 2 n n
( yi ) 2
where S xx = ( xi − x ) 2 = xi2 − i =1
, S xx = ( yi − y ) 2 = y i2 − i =1
,
i =1 i =1 n i =1 i =1 n
n n
n n
( xi ) ( yi )
S xy = ( xi − x ) ( y i − y ) = xi y i − i =1 i =1
.
i =1 i =1 n
Problem 3.
Calculate r for the data { (8, 3), (1, 4), (5, 0), (4, 2), (7, 1) }.
Solution
x = 25/5 = 5. y = 10/5 = 2.
n
( xi − x ) ( y i − y ) = 3 x 1 + (-4) x 2 + 0 x (-2) + (-1) x 0 + 2 x (-1) = -7
i =1
n
( xi − x ) 2 = 9 + 16 + 0 + 1 + 4 = 30
i =1
n
( y i − y ) 2 = 1 + 4 + 4 + 0 + 1 = 10
i =1
−7
Hence r = = - 0.404.
(30) (10)
168
Problem 4.
The following are the measurements of the air velocity and evaporation coefficient of
burning fuel droplets in an impulse engine:
n n
( xi ) 2
Solution. S xx = ( xi − x ) 2 = xi2 − i =1
= 532000 – (2000)2 /10 = 132000
i =1 i =1 n
n n
( yi ) 2
S xx = ( yi − y ) 2 = y i2 − i =1
= 9.1097 – (8.35)2 /10 = 2.13745
i =1 i =1 n
n n
( xi ) ( yi )
n n
(2000) (8.35)
S xy = ( xi − x ) ( y i − y ) = xi y i − i =1 i =1
= 2175.4 –
i =1 i =1 n 10
= 505.4
S xy 505.4
Hence r = = = 0.9515.
S xx S yy (132000) (2.13745)
**************
169