Anda di halaman 1dari 17

Simple Linear Regression (2 hour)

8.0 Basic Concept


8.1 Graphical Method
8.2 Correlation of Coefficient
8.3 Simple Linear Regression Equation
8.3.1 Least Square Method

8.0 BASIC CONCEPT

Linear regression is a study on the linear relationship between two variables. This is done by
fitting a linear equation to the observed data. The linear equation is then used to predict values
for the data. In a simple linear relationship, only two variables are involved:
X is the independent variable
Y is the dependent variable

Example:
A nutritionist studying weight loss programs might wants to find out if reducing intake of
carbohydrate can help a person reduce weight.
o X is the carbohydrate intake
o Y is the weight
An entrepreneur might want to know whether increasing the cost of packaging his new
product will have an effect on the sales volume.
o X is cost
o Y is sales volume

8.1 GRAPHICAL METHOD

A scatter plot is essentially a ploy between the pair of (, ) values. The purpose of constructing
the plot is to examine the relationship between the two variables.

    

    

The scatter plots in figure ( ) and ( ) below shows the existence of a positive and a
negative relationship, whilst ( ) shows no relationship between variable  and .

1 Nurhana.PPD (2_14/15)
(a) Positive linear relationship

(b) Negative linear relationship

(c) No relationship between  and 

2 Nurhana.PPD (2_14/15)
Example A
In BMX dirt-bike racing, jumping high or getting air depends on many factors; the riders
skill, the angle of the jump, and the weight of the bike. Here are the data about maximum height
for various bike weights.

Type A B C D E F G H I J
Weight 19 19.5 20 20.5 21 22 22.5 23 23.5 24
(pounds)
Height 10.35 10.3 10.25 10.2 10.1 9.85 9.8 9.79 9.7 9.6
(inches)

Construct a scatter diagram for these data. Does the scatter diagram exhibit a linear relationship
between weight and height of the BMX dirt-bike?

Solution:

Yes the scatter diagram exhibits a linear (negative) relationship between weight and height.

3 Nurhana.PPD (2_14/15)
Example B
The dosage chart below was prepared by a drug company for doctors who prescribed
Tobyamycin, a drug that combats serious bacterial infections such as those in central system, for
life threatening situations.

Weight 88 99 110 121 132 143 154 165 176 187 198
(pounds)X
Usual Dosage 40 45 50 55 60 65 70 75 80 85 90
(mg)Yu
Maximum Dosage 66 75 83 91 100 108 116 125 133 141 150
(mg)Ym

i) Construct a scatter diagram for the data (weight, usual dosage)


ii) Construct a scatter diagram for the data (weight, maximum dosage) on the same axes
Does the scatter diagram exhibit a linear relationship between weight and usual dosage, and also
between weight and maximum dosage.

Solution:

Yes the scatter diagram exhibits exhibit a linear (positive) relationship between weight and
usual dosage, and also between weight and maximum dosage

4 Nurhana.PPD (2_14/15)
8.2 CORRELATION OF COEFFICIENT
Correlation measures the strength of a linear relationship between two variables. One
numerical measure is the Pearson product moment correlation coefficient, 

=
 
Where 2
n
n
xi
S xx = xi i =1
2

i =1 n
n n
n
x i y i
S xy = x i y i i =1 i =1
i =1 n
2
n
n
yi
S yy = yi i=1
2

i =1 n

Properties of :
1. 1  1

2. (a) Values of  equal to 1 implies there is a perfect positive linear relationship between  and .
(b) Values of  equal to 1 implies there is a perfect negative linear relationship between  and .

3. (a) Values of  close to 1 implies there is a strong positive linear relationship between  and .
(b) Values of  close to 1 implies there is a strong negative linear relationship between  and .

4. (a) Values of  close to 0.5 implies there is a weak positive linear relationship between  and .
(b) Values of  close to 0.5 implies there is a weak negative linear relationship between  and .

5. Values of  equal to 0 implies there is no linear relationship between  and .

To find and interpret  , we need to build a table with five major column heading by , , ,  
and   , and add the last row with a summation of the values.
# $ #$ #% $%

'# = '$ = '#$ = %
'# = '$ =%

Lets take a look at the example below:

5 Nurhana.PPD (2_14/15)
Example C
The data in the table below are the circumference (in feet) and heights (in feet) of trees in a
certain forest reserve.
Circumference, 1.8 1.9 1.8 2.4 5.1 3.1 5.5 5.1 8.3 13.7
Height,  21 33.5 24.6 40.7 73.2 24.9 40.4 45.3 53.5 93.9

Calculate the value of correlation coefficient,  and interpret the result.

Solution:
Circumference,# Height, $ #$ #% $%
1.8 21 37.8 3.24 441
1.9 33.5 63.65 3.61 1122.25
1.8 24.6 44.28 3.24 605.16
2.4 40.7 97.68 5.76 1656.49
5.1 73.2 373.32 26.01 5358.24
3.1 24.9 77.19 9.61 620.01
5.5 40.4 222.2 30.25 1632.16
5.1 45.3 231.03 26.01 2052.09
8.3 53.5 444.05 68.89 2862.25
13.7 93.9 1286.43 187.69 8817.21
'# = y = '#$ = %
'# = %
'$ =
48.7 451 2877.63 364.31 25166.86

2 2
n n
xi n
yi
= yi i =1
n
S xx = xi i =1
2 2
S yy
i =1 n i =1 n
(48.7 )2 = 25166.86
(451)
2

= 364.31
10 10
= 364.31 237.169 = 25166.86 20340.1
= 127.141 = 4826.76
n n
n
xi y i
= xi y i i =1
i =1
S xy S xy 681.26 681.26
i =1 n r= = = = 0.869
S xx S yy (127.141)(4826.76) 783.38
48.7( 451)
= 2877.63
10
Strong positive linear relationship.
= 2877 .63 2196 .37
= 681.26

6 Nurhana.PPD (2_14/15)
Example D
A researcher wishes to see if there is a relationship between the ages and net worth of the
wealthiest people in Malaysia. The data for a specific year are shown.

Age, # 73 65 53 54 79 69 61 65
Net wealth, $ ($ billions) 16 26 50 21.5 40 16 19.6 19

Calculate the value of correlation coefficient,  and interpret the result.

Solution:
Age, # Net wealth, $ ($ billions) #$ #% $%
73 16 1168 5329 256
65 26 1690 4225 676
53 50 2650 2809 2500
54 21.5 1161 2916 462.25
79 40 3160 6241 1600
69 16 1104 4761 256
61 19.6 1195.6 3721 384.16
65 19 1235 4225 361
'# = y = '#$ = '#% = '$ % =
519 208.1 13363.6 34227 6495.41

2 2
n n
n
xi n 2 yi
S xx = xi i =1 = yi i =1
2
S yy
i =1 n i =1 n

= 34227
(519)2 = 6495.41
(208.1)2
8 8
= 34227 33670.13 = 6495.41 5413.2
= 556.87 = 1082.21
n n
n
xi yi
S xy = xi yi i =1 i =1 S xy 136.89 136.89
i =1 n r= = = = 0.18
S xx S yy (556.87)(1082.21) 776.31
519( 208.1)
= 13363.6
8
Weak negative linear relationship.
= 13363.6 13500.49
= 136.89

7 Nurhana.PPD (2_14/15)
Example E
The data below obtained in a study on the number of absences and the final grades of seven
randomly selected students from statistics class.
Student A B C D E F G
Number of absences, # 6 2 15 9 12 5 8
Final grade, $ (%) 82 86 43 74 58 90 78

Calculate the value of correlation coefficient,  and interpret the result.

Solution:
# $ #$ #% $%
6 82 492 36 6724
2 86 172 4 7396
15 43 645 225 1849
9 74 666 81 5476
12 58 696 144 3364
5 90 450 25 8100
8 78 624 64 6084
'# = '$ = '#$ = '#% = '$ % =
57 511 3745 579 38993

2 2
n n
n
xi n 2 yi
= xi i =1 S yy = yi i =1
2
S xx
i =1 n i =1 n

= 579
(57 )
2
= 38993
(511)
2

7 7
= 579 464.14 = 38993 37303
= 114.86 = 1690
n n
n
x i y i
S xy = x i yi i =1
i =1
S xy 416 416
i =1 n r= = = = 0.94
S xx S yy (114.86)(1690) 440.58
57 (511)
= 3745
7
Strong negative linear relationship.
= 3745 4161
= 416

8 Nurhana.PPD (2_14/15)
Example F
Find the correlation coefficient,  for the following set of data.
#) 1 1 2 4 7 8 8
$) 4 5 8 15 23 20 25

Calculate the value of correlation coefficient,  and interpret the result.

Solution:
# $ #$ #% $%
1 4 4 1 16
1 5 5 1 25
2 8 16 4 64
4 15 60 16 225
7 23 161 49 529
8 20 160 64 400
8 25 200 64 625
'# = '$ = '#$ = '#% = '$ % =
31 100 606 199 1884

2 2
n n
n
xi n 2 yi
= xi i =1 S yy = yi i =1
2
S xx
i =1 n i =1 n

= 199
(31) 2
= 1884
(100)2
7 7
= 199 137.29 = 1884 1428.57
= 61.71 = 455
n n
n
x i y i
S xy = x i yi i =1
i =1
S xy 163.14 163.14
i =1 n r= = = = 0.97
S xx S yy (61.71)(455) 167.57
31(100 )
= 606
7
Strong positive linear relationship.
= 606 442.86
= 163 .14

9 Nurhana.PPD (2_14/15)
8.3 SIMPLE LINEAR REGRESSION EQUATION
A linear regression equation is a mathematical equation that can be used to predict the
values of one dependent variable from known values of an independent variable. This equation
represents a straight line so it is the form  = *+ + *  where * is the slope and *+ is the
 intercept.
The basic simple linear regression is given by

 = *+ + *  + -
Where
 = independent variable
 = dependent variable
*+ =  intercept of the line
* = slope of the line
- = random error component

This regression line is estimated from the data collected by fitting a straight line to the
data set and getting the equation of the straight line

. = */+ + */ 

Model assumption about random error:


-~123(0, 4  ) that is
1. The expected value of - is zero, 5(-) = 0
2. The variance of - is constant for all x
3. -~123(0, 4  ) that is the probability distribution of - is normal with mean 0 and
variance 4  .
4. The values of - are independent, that is the value of - associated with one y value has
no effect on the value taken on by - for other y values.

8.3.1 LEAST SQUARE METHOD


The least square method is the method most commonly used for estimating the regression
coefficient, */+ and */.

The straight line fitted the data set is the estimated linear regression line
. = */+ + */ 

10 Nurhana.PPD (2_14/15)
For any data point ( ,  ), the deviation or error is given by
6 =  .
6 =  (*/+ + */  )

To minimize this error, the sum of squared error


7 6  = 7( */+ */  )


8

Is minimized by finding the partial derivatives and setting them equal to zero. That is
9
= 2 7( */+ */  ) = 0
9:.
9
= 2 7  ( */+ */  ) = 0
9*/

By rearranging terms, the following normal equations are obtained

;*/+ + */ 7  = 7 

*/+ 7  + */ 7   = 7  

The equations can be solved simultaneously giving the following estimators


n n

yi x i
0 = y 1 x = i =1

1
i =1

n n
S xy
1 =
S xx

Where < and  are the means of x and y respectively


*/+ is the intercept of  axis
*/ is the slope of the regression equation

Given any value of  , the predicted value of the dependent variable . can be found by
substituting  into fitted simple linear regression.

11 Nurhana.PPD (2_14/15)
Table below summarize the formulation used in Linear Regression


Correlation Coefficient =
 

Where 2
n
n
xi
S xx = xi i =1
2

i =1 n
n n
n
x i y i
S xy = x i y i i =1 i =1
i =1 n
2
n
n
yi
S yy = yi i=1
2

i =1 n

Basic simple linear regression model  = *+ + *  + -

Where  = independent variable


 = dependent variable
*+ =  intercept of the line
* = slope of the line
- = random error component

Fitted simple linear regression . = */+ + */ 

Where n n

yi x i
0 = y 1 x = i =1
1 i =1

n n
S xy
1 =
S xx

Where < and  = the means of x and y respectively


*/+ =  intercept of the line
*/ = slope of the regression equation

12 Nurhana.PPD (2_14/15)
Example G (From Example C )

S xx = 127.141, S xy = 681.26, S yy = 4826.76,


n n

yi = 451,
i =1
x i =1
i = 48.7, n = 10

(a) Find out the slope 1 and estimates of intercept 0 .


(b) Find the regression line equation.

Solution:
(a) S xy
1 =
S xx
681.26
=
127.141

= 5.36
n n

y i x i
0 = y 1 x = i =1
1 i =1

n n
451 48.7
= (5.36)
10 10
= 45.1 26.1

= 19

(b) y = 0 + 1 x
= 19 + 5.36x

Example H (From Example D )

S xx = 556.87 , S xy = 136.89 , S yy = 1082.21


n n

yi = 208.1,
i =1
x i =1
i = 519, n = 8

(a) Find out the slope 1 and estimates of intercept 0 .


(b) Find the regression line equation.
(c) Predict the net wealth of a 35 year old people of this model.

Solution:
(a) S xy 136 .89
1 = = = 0.25
S xx 556 .87

n n

yi x i
208.1 519
0 = y 1 x = 1 (- 0.25 )
i =1 i =1
= = 26.01 + 16 .22 = 42 .23
n n 8 8

13 Nurhana.PPD (2_14/15)
slope 1 = 0 . 25 - and estimates of intercept 0 = 42.23

(b) y = 0 + 1 x
= 42.23 0.25 x

(c) When  = 35

. = 42.23 0.25 (35)


= 33.48 billions

Example I
From Example E
n n
S xx = 114.86 , S xy = 416 , S yy = 1690 , yi = 511,
i =1
x
i =1
i = 57, n = 7

(a) Find out the slope 1 and estimates of intercept 0 .


(b) Find the regression line equation.
(c) Predict the grade if the student absence twice of this model.

Solution:
(a) S xy
1 =
S xx
416
=
114 .86

= - 3.62
n n

y i x i
0 = y 1 x = i =1
1 i =1

n n
511 57
= ( 3.62)
7 7
= 73 + 29.48

= 102.48

(b) y = 0 + 1 x
= 102.48 3.62 x

(c) When  = 2

. = 102.48 3.62 (2)


= 95.2%

14 Nurhana.PPD (2_14/15)
Example J
From Example F
n n
S xx = 61.71 , S xy = 163.14 , S yy = 455 , yi = 100 ,
i =1
x
i =1
i = 31, n = 7

(a) Find out the slope 1 and estimates of intercept 0 .


(b) Find the regression line equation.
(c) Predict the . grade if  = 3

Solution:
(a) S xy
1 =
S xx
163 .14
=
61 .71

= 2.64

n n

y i x i
0 = y 1 x = i =1

1
i =1

n n
100 31
= (2.64 )
7 7
= 14.29 11.69

= 2.6

(b) y = 0 + 1 x
= 2.6 + 2.64 x

(c) When  = 3

. = 2.6 + 2.64 (3)


= 10.52

15 Nurhana.PPD (2_14/15)
Example K [FULL EXERCISE]
A study is conducted to determine the relationship between drivers age and the number of
accident he/she involved over a year.

Drivers age,  16 24 18 17 23 27 32
Number of accidents,  3 2 5 2 0 1 1

(a) Construct a scatter diagram for these data. Does the scatter diagram exhibit a linear
relationship between drivers age and number of accident?
(b) Calculate the value of correlation coefficient,  and interpret the result.
(c) Find out the slope 1 and estimates of intercept 0 .
(d) Find the regression line equation.
(e) Predict the number of accident involved if he/she is 20 years old.

16 Nurhana.PPD (2_14/15)
Example L [FULL EXERCISE]

The data below represent scores obtained by 10 primary school students before and after they
were taken on a tour to the museum (which is supposed to increase their interest in history).

Before,  65 63 76 46 68 72 68 57 36 96
After,  68 66 86 48 65 66 71 57 42 87

(a) Construct a scatter diagram for these data. Does the scatter diagram exhibit a linear
relationship between scores before and after their tour to the museum?
(b) Calculate the value of correlation coefficient,  and interpret the result.
(c) Find out the slope 1 and estimates of intercept 0 .
(d) Find the regression line equation.
(e) Predict the score after the tour if he/she scored 60 marks before the tour.

Answer:
(a) Yes. Exhibit linear relationship.
(b)  = 0.94 (strong positive linear)

(c) 1 0.82,
=
0 =12.31

17 Nurhana.PPD (2_14/15)