Anda di halaman 1dari 6

Page 1

Regression And Correlation : Lecture Notes


Regression And Correlation : Bioepi 540 UMASS-Amherst Fall 2000
Regression 1
Note:
We have 30 pairs of observations which we can denote as
(x
1
,y
1
), (x
2
,y
2
), , (x
30
,y
30
) = (39,144), (47, 220), , (69, 175)
Where x
i
refers to age for the i
th
subject, and y
i
to SBP for
the i
th
subject
n
These data pairs may be considered as points in two
dimensional space, so that we may plot them on a
graph.
How do we measure the association of 2 continuous,
numeric scale variables? Lets begin with an example:
Observations on both systolic blood pressure (SBP) and
age are available for a sample of 30 individuals. We
are interested in the relationship between SBP and age
for these patients, and for the population which they
represent.
Such a graph is called a sc atterplot or scatter diagram.
individual SBP AGE individual SBP AGE
(i) (Y) (x) (i) (Y) (x)
1 144 39 16 130 48
2 220 47 17 135 45
3 138 45 18 114 17
4 145 47 19 116 20
5 162 65 20 124 19
6 142 46 21 136 36
7 170 67 22 142 50
8 124 42 23 120 39
9 158 67 24 120 21
10 154 56 25 160 44
11 162 64 26 158 53
12 150 56 27 144 63
13 140 59 28 130 29
14 110 34 29 125 25
15 128 42 30 175 69
An Introduction to REGRESSION AND CORRELATION
Regression 2
Scatter diagram of age and systolic blood pressure
Note that age and SBP seem to be related. Younger
subjects tend to have lower SBP, and older subjects higher
SBP. How can this relationship be measured?
Scatter diagrams can take many shapes
y
x
No relationship
between x and y.
Spread is even in
all direc tions.
y
x
Linear relationship:
A line indic ates the
main direc tion of
the spread of points.
y
x Non-linear relationship
between x and y.
A curve best describes
the relationship.
120
140
160
180
200
220
240
20 30 40 50 60 70 80
AGE
0
0
Regression 3
y
Math Review: Equation for a Line
y =
0
+
1
x
0
1
y
x
y
x

0
= " y-intercept"
= value of y when x = 0
= "slope" =
e.g.
Slope > 0
Slope = 0
Slope < 0
y
x
x

0
= y- intercept = value of y when x = 0

1
= slope = y/x = (change in y)/ (change in x)
Regression 4
We will use a technique know as Least Squares Regression
to estimate
0
and
1
. We will denote the estimates
0
and

1
, respectively.
We are looking for that line which minimizes the vertical
distances to the data points.
d
i
d
i

0 +

1 x

'



Now, given a set of data, how can we get the line that
best fits or best represents the data?

'

0
and
1
are
chosen such that
the sum of the
squared vertical
distances,
d
i
2
is minimized.

n
i=1
For each observed value x
i
, we have an observed y
i
,
and the predicted value y
i
, on the line. The vertical
distances d
i
= (y
i
y
i
).
Thus d
i
2
= (y
i
y
i
)
2
that is, the sum of squared
deviations from the line sound familiar?
^
^
^

i=1

n
i=1
n
That is, we want the line such that
(y
i
y
i
)
2
= (y
i

0

1
x)
2
is minimized.

n
i=1

n
i=1
^
^ ^
^
^
^ ^
Page 2
Regression And Correlation : Lecture Notes
Regression And Correlation : Bioepi 540 UMASS-Amherst Fall 2000
Regression 5
y

0 +

1 x

y
i
y
i
^
x
i
( x
i
,y
i
)
( x
i
,y
i
)
^
x
y
This can be diagramed as:
The minimum sum of squares corresponding to the least
squares estimates
0
and
1
is usually called:
"The sum of squares about the regression line"
or
"The residual sum of squares "
or
"The sum of squares due error" (SSE)
^ ^
The SSE is important in assessing the quality of the
straight - line fit. This will be discussed in more detail
later.
Regression 6
The solution to the best- fit problem is obtained by solving,
simultaneously, the following equations:
y
i

n
0
+
1
x
i

x
i
y
i


0
x
i

+
1
x
i
2

Hence, as can be seen from the above equations, all we


need to solve for
0
and
1
is:
SSE y
i


1
x
i ( )
2
y
i

0
*

1
*
x
i ( )
2
i 1
n

i 1
n

These come from calculus first take derivative with


respect to
0
and then with respect to
1
, and set equal
to zero.
x
i
, y
i
, x
i
y
i
, x
i
2
, n
Note: if
0
* and
1
* denote any other possible estimators
of
0
and
1
such that y* =
0
* +
1
*x, then
Regression 7
)
The unbiased estimates of
0
and
1
which are the least
squares estimates (and also the minimum variance
estimates) are more easily computed by:
and

1

x
i
x ( y
i
y ( )
1
n

1
x
i
x ( )
2
^

0
y
1
x
^ ^
Example
Using the previous data on 30 individuals where we
measured
X=AGE
Y=SBP
Regression 8
computations result in:
or
and
or
Thus, the equation for this straight line is given by


1
0. 97


0
y


1
x 142. 53 0. 97 ( ) 45. 13 ( )


0
98.71
y 98.71+ 0. 97x
Page 3
Regression And Correlation : Lecture Notes
Regression And Correlation : Bioepi 540 UMASS-Amherst Fall 2000
Regression 9
This line can now be plotted on the scatter diagram.
Now, rec all that
where
0 20 30 40 50 60 70 80
AGE
120
140
160
180
200
220
240
20 30 40 50 60 70 80
as the fit gets worse, SSE gets larger
y 98.71 + 0. 97 x
y 97. 08 + 0. 95x
= least squares line
eliminating the outlier
note: n = 29.
SSE y
i
y
i ( )
2
i 1
n

y
i


0
+


1
x
i
clearly , if SSE 0 perfec t fit
i. e., y
i
y
i
, all i
Regression 10
TOTAL OBSERVATIONS: 30
SBP AGE
N OF CASES 30 30
MINIMUM 110.000 17.000
MAXIMUM 220.000 69.000
RANGE 110.000 52.000
MEAN 142.533 45.133
VARIANCE 509.913 233.913
STANDARD DEV 22.581 15.294
STD. ERROR 4.123 2.792
SKEWNESS(G1) 1.292 -0.240
KURTOSIS(G2) 2.684 -0.833
SUM 4276.000 1354.000
C.V. 0.158 0.339
MEDIAN 141.000 45.500
DEP VAR: SBP N: 30 MULTIPLE R: 0.658 SQUARED MULTIPLE R: 0.432
ADJUSTED SQUARED MULTIPLE R: 0.412 STANDARD ERROR OF ESTIMATE: 17.314
VARIABLE COEFFICIENT STD ERROR STD COEF TOLERANCE T P(2 TAIL)
CONSTANT 98.715 10.000 0.000 . 9.871 0.000
AGE 0.971 0.210 0.658 1.000 4.618 0.000
ANALYSIS OF VARIANCE
SOURCE SUM- OF-SQUARES DF MEAN-SQUARE F-RATIO P
REGRESSION 6394.023 1 6394.023 21.330 0.000
RESIDUAL 8393.444 28 299.766
Regression 11
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
A B C D E F
I D SBP(y) AGE(x) y^ 2 x^ 2 xy
1 144 39 20736 1521 5616
2 220 47 48400 2209 10340
3 138 45 19044 2025 6210
4 145 47 21025 2209 6815
5 162 65 26244 4225 10530
6 142 46 20164 2116 6532
7 170 67 28900 4489 11390
8 124 42 15376 1764 5208
9 158 67 24964 4489 10586
10 154 56 23716 3136 8624
11 162 64 26244 4096 10368
12 150 56 22500 3136 8400
13 140 59 19600 3481 8260
14 110 34 12100 1156 3740
15 128 42 16384 1764 5376
16 130 48 16900 2304 6240
17 135 45 18225 2025 6075
18 114 17 12996 289 1938
19 116 20 13456 400 2320
20 124 19 15376 361 2356
21 136 36 18496 1296 4896
22 142 50 20164 2500 7100
23 120 39 14400 1521 4680
24 120 21 14400 441 2520
25 160 44 25600 1936 7040
26 158 53 24964 2809 8374
27 144 63 20736 3969 9072
28 130 29 16900 841 3770
29 125 25 15625 625 3125
30 175 69 30625 4761 12075
Sum 4276 1354 624260 67894 199576
Mean 142.53 45.13
Variance 509.91 233.91
Std Dev 22.58 15.29
n= 30
slope= 0.971
interc ept= 98.715
c orr= 0.658
Regression 12
1 2 3 4 5 6 7 8 9 1
0
1
1
1
2
1
3
1
4
1
5
1
6
1
7
1
8
1
9
2
0
2
1
2
2
2
3
2
4
2
5
2
6
2
7
2
8
2
9
3
0
3
1
3
2
3
3
3
4
3
5
3
6
3
7
3
8
3
9
4
0
A
B
C
D
E
I
D
S
B
P
(
y
)
A
G
E
(
x
)
y
^
2
x
^
2
1
1
4
4
3
9
=
B
2
*
B
2
=
C
2
*
C
2
=
A
2
+
1
2
2
0
4
7
=
B
3
*
B
3
=
C
3
*
C
3
=
A
3
+
1
1
3
8
4
5
=
B
4
*
B
4
=
C
4
*
C
4
=
A
4
+
1
1
4
5
4
7
=
B
5
*
B
5
=
C
5
*
C
5
=
A
5
+
1
1
6
2
6
5
=
B
6
*
B
6
=
C
6
*
C
6
=
A
6
+
1
1
4
2
4
6
=
B
7
*
B
7
=
C
7
*
C
7
=
A
7
+
1
1
7
0
6
7
=
B
8
*
B
8
=
C
8
*
C
8
=
A
8
+
1
1
2
4
4
2
=
B
9
*
B
9
=
C
9
*
C
9
=
A
9
+
1
1
5
8
6
7
=
B
1
0
*
B
1
0
=
C
1
0
*
C
1
0
=
A
1
0
+
1
1
5
4
5
6
=
B
1
1
*
B
1
1
=
C
1
1
*
C
1
1
=
A
1
1
+
1
1
6
2
6
4
=
B
1
2
*
B
1
2
=
C
1
2
*
C
1
2
=
A
1
2
+
1
1
5
0
5
6
=
B
1
3
*
B
1
3
=
C
1
3
*
C
1
3
=
A
1
3
+
1
1
4
0
5
9
=
B
1
4
*
B
1
4
=
C
1
4
*
C
1
4
=
A
1
4
+
1
1
1
0
3
4
=
B
1
5
*
B
1
5
=
C
1
5
*
C
1
5
=
A
1
5
+
1
1
2
8
4
2
=
B
1
6
*
B
1
6
=
C
1
6
*
C
1
6
=
A
1
6
+
1
1
3
0
4
8
=
B
1
7
*
B
1
7
=
C
1
7
*
C
1
7
=
A
1
7
+
1
1
3
5
4
5
=
B
1
8
*
B
1
8
=
C
1
8
*
C
1
8
=
A
1
8
+
1
1
1
4
1
7
=
B
1
9
*
B
1
9
=
C
1
9
*
C
1
9
=
A
1
9
+
1
1
1
6
2
0
=
B
2
0
*
B
2
0
=
C
2
0
*
C
2
0
=
A
2
0
+
1
1
2
4
1
9
=
B
2
1
*
B
2
1
=
C
2
1
*
C
2
1
=
A
2
1
+
1
1
3
6
3
6
=
B
2
2
*
B
2
2
=
C
2
2
*
C
2
2
=
A
2
2
+
1
1
4
2
5
0
=
B
2
3
*
B
2
3
=
C
2
3
*
C
2
3
=
A
2
3
+
1
1
2
0
3
9
=
B
2
4
*
B
2
4
=
C
2
4
*
C
2
4
=
A
2
4
+
1
1
2
0
2
1
=
B
2
5
*
B
2
5
=
C
2
5
*
C
2
5
=
A
2
5
+
1
1
6
0
4
4
=
B
2
6
*
B
2
6
=
C
2
6
*
C
2
6
=
A
2
6
+
1
1
5
8
5
3
=
B
2
7
*
B
2
7
=
C
2
7
*
C
2
7
=
A
2
7
+
1
1
4
4
6
3
=
B
2
8
*
B
2
8
=
C
2
8
*
C
2
8
=
A
2
8
+
1
1
3
0
2
9
=
B
2
9
*
B
2
9
=
C
2
9
*
C
2
9
=
A
2
9
+
1
1
2
5
2
5
=
B
3
0
*
B
3
0
=
C
3
0
*
C
3
0
=
A
3
0
+
1
1
7
5
6
9
=
B
3
1
*
B
3
1
=
C
3
1
*
C
3
1
S
u
m
=
S
U
M
(
B
2
:
B
3
2
)
=
S
U
M
(
C
2
:
C
3
2
)
=
S
U
M
(
D
2
:
D
3
2
)
=
S
U
M
(
E
2
:
E
3
2
)
F
x
y
=
B
2
*
C
2
=
B
3
*
C
3
=
B
4
*
C
4
=
B
5
*
C
5
=
B
6
*
C
6
=
B
7
*
C
7
=
B
8
*
C
8
=
B
9
*
C
9
=
B
1
0
*
C
1
0
=
B
1
1
*
C
1
1
=
B
1
2
*
C
1
2
=
B
1
3
*
C
1
3
=
B
1
4
*
C
1
4
=
B
1
5
*
C
1
5
=
B
1
6
*
C
1
6
=
B
1
7
*
C
1
7
=
B
1
8
*
C
1
8
=
B
1
9
*
C
1
9
=
B
2
0
*
C
2
0
=
B
2
1
*
C
2
1
=
B
2
2
*
C
2
2
=
B
2
3
*
C
2
3
=
B
2
4
*
C
2
4
=
B
2
5
*
C
2
5
=
B
2
6
*
C
2
6
=
B
2
7
*
C
2
7
=
B
2
8
*
C
2
8
=
B
2
9
*
C
2
9
=
B
3
0
*
C
3
0
=
B
3
1
*
C
3
1
=
S
U
M
(
F
2
:
F
3
2
)
M
e
a
n
=
A
V
E
R
A
G
E
(
B
2
:
B
3
1
)
=
A
V
E
R
A
G
E
(
C
2
:
C
3
1
)
V
a
r
ia
n
c
e
=
V
A
R
(
B
2
:
B
3
1
)
=
V
A
R
(
C
2
:
C
3
1
)
S
t
d
D
e
v
=
S
T
D
E
V
(
B
2
:
B
3
1
)
=
S
T
D
E
V
(
C
2
:
C
3
1
)
n
=
=
C
O
U
N
T
(
E
2
:
E
3
1
)
s
lo
p
e
=
=
(
F
3
3
-
(
(
C
3
3
*
B
3
3
)
/
E
3
7
)
)
/
(
E
3
3
-
(
C
3
3
^
2
)
/
E
3
7
)
in
t
e
r
c
e
p
t
=
=
B
3
4
-
E
3
8
*
C
3
4
c
o
r
r
=
=
E
3
8
*
C
3
6
/
B
3
6
Page 4
Regression And Correlation : Lecture Notes
Regression And Correlation : Bioepi 540 UMASS-Amherst Fall 2000
Regression 13
Now, one of the assumptions for regression analysis is that of
homosc edastic ity (the variance of y is the same for any x).
x x x x 1 2
3
4

y| x
y| x
2

y| x
3

y| x
4
x
y

y | x

0
+
1
x
Here ,
y| x
1
2

y| x
2
2

y| x
3
2

y| x
4
2
i. e. ,
y| x
i
2
is the same for all i
We will denote this c ommon value
i.e.,

y| x
2

2
for all x .
1
For each value of x, the values of y are normally
distributed around
y| x
, on the line, with the same
variance for all values of x, but different means,
y| x
.

2
Regression 14
y y (
In regression analysis, X is used to predict Y in our
example, age to predict SBP.
An estimate of
2
is given by the formula
sy| x
2
1
n2
y
i

i
)
2
i 1
n

1
n 2
SSE ( )

n 1
n 2
s
y
2

1
2
s
x
2
( ) lose 2 d.f.
one for 0
one for 1
sample varianc e of x
sample varianc e of y
Regression 15
s
x
2

x
i
x ( )
2

n 1

x
i
2

x
i ( )
2
n
n 1
and
s
y
2

y
i
y ( )
2

n 1

y
i
2

y
i ( )
2
n
n 1
here
S
y| X
S
Y| X
2
299.77 17.31
S
Y| X
2
S
Y| X
where
is called the "standard error of estimate"
lose 1 d.f. for
estimating
in our example
S
Y| X
2

n 1
n 2
S
Y
2


1
2
S
x
2
( )
29
28
509. 91.097
2
233.91 ( ) ( )
S
Y
2
509.91
SX
2
233. 91


1
0. 97
S
Y| X
2
299.77
Regression 16
Now, if we assume that for any fixed value of x, y has a normal distribution,
we can test hypotheses and construct confidence intervals for
0
or
1
.
Under this assumption, it c an be shown that
and
First consider
1
or, setting up confidence intervals for
1


0
~N
0
,
2 1
n
+
X
2
n 1 ( )S
X
2



_
,




_
,


1
~ N
1
,

2
SX
2
n 1 ( )



_
,

and
t


1

1
(0)
s
y | x
s
x
n 1
t ~ t (n 2)
In order to test H0 : 1
1
(0)
, where
1
(0)
is some
hypothesized value for
1
, the test statistic is


1
t
1
2
n 2 ( )
S
Y| X
S
x
n 1



1
]
1
1


1
+ t
1
2
n 2 ( )
S
Y| X
S
x
n 1



1
]
1
t-distribution with n-2 degrees of freedom.
Since we don ' t know
2
we estimate it with sy| x
2
and use the
Page 5
Regression And Correlation : Lecture Notes
Regression And Correlation : Bioepi 540 UMASS-Amherst Fall 2000
Regression 17
e.g.
In the current example, suppose we wish to test
then,
and we reject
A better model might exist (e.g, one with a c urvilinear term), but
there is a definite linear component.
vs.
This means that x provides significant information for the prediction
for predicting y.

would provide
a better fit
The straight line model may very well
represent only a linear approximation
to a truly nonlinear relationship
linear model certainly
y fits better than y
H
0
:
1
0
H
a
:
1
0
t


1

1
(0 )
s
y| x
s
x
n 1

0. 97 0
17. 31
15. 29 ( ) 29
4. 62
H0 if t > t .975 28 ( ) 2.0484
or if t < t
.025
28 ( ) 2. 0484
reject H 0 at . 05
(in fact, p < . 001)
of y . That is, y y +

1 x x ( ) is far better than the naive model


Regression 18
note: if H0: 1 0 is not rejected it means either

x provides little or no
help in predicting y.
The true relationship between
x and y is not linear.

or
* Important point: whether or not H
0
:
1
0 is rejected , the straight - line
model may not be appropriate. Some other func tion may better
desc ribe the relationship between x and y.
Now Consider
0
and
In order to test H
0
:
0
(0)
, the test statistic used is
t


0
(0)
s
y| x
1
n
+
x
2
n 1 ( )s
x
2
t ~ t n 2 ( )
0
0
Regression 19
98. 712 . 0484 17. 31 ( )
1
30
+
45.13 ( )
2
29 15.29 ( )
2

0
98.71+ 2.0484 17. 31 ( )
1
30
+
45.13 ( )
29 15.29 ( )
2
78. 23
0
119. 20
and c onfidenc e intervals may be c onstruc ted as
e.g.
Continuing this example, to test

0 t 1
2
n 2 ( ) SY| X
1
n
+
X
2
n 1 ( )S
X
2




1
]
1
1
0

0 + t
1
2
n 2 ( ) SY| X
1
n
+
X
2
n 1 ( )S
x
2




1
]
1
1
and again reject H
0
at = .05 level
here .02<p<.05
and
vs.
2.37
H
0
:
0
75
H
a
:
0
75
t

0

0
(0)
s
y| x
1
n
+
x
2
n 1 ( )s
x
2

98.71 75
17. 31
1
30
+
45.13 ( )
2
29 ( ) 15. 29 ( )
2
2
-
Regression 20
The Correlation Coefficient
DEF: The c orrelation c oeffic ient provides a measure of how two random
variables are associated in a sample. It is also a measure of the strength
of the staight-line relationship between X and Y.
Exampl e- In the age, SBP example,
r
c o v x ,y ( )
va r x ( ) va r y ( )

x
i
x
(
) y
i
y ( )
i 1
n

x
i
x ( )
2
y
i
y ( )
2
i 1
n

i 1
n

x
i
y
i
i 1
n

x
i ( ) y
i ( )
n
x
i
2
i 1
n

x
i ( )
2
n




1
]
1
1
y
i
2
i 1
n

y
i ( )
2
n




1
]
1
1

'




;



1/ 2
S
x
S
y
note: since

1
c o v x , y ( )
va r x ( )

S
xy
S
x
2
and r
S
xy
S
x
S
y
we have that r
Sx
S
y


1
r
199, 576
1354 ( ) 4276 ( )
30
67894
1354
2
30



1
]
1 624260
4276
2
30



1
]
1

'


;

1/ 2
0. 66
or, more simply, since 0. 97, r
15.29
22. 58
0.97 ( ) 0. 66
1
Page 6
Regression And Correlation : Lecture Notes
Regression And Correlation : Bioepi 540 UMASS-Amherst Fall 2000
Regression 21
y
i
y
x i
x
x i
x
y
i y
and r is dimensionless - i.e., it is independent of the units
of measurement of x or y
Actually, r is the standardized covariance.
Let us motivate what is meant by the c ovarianc e between x and y.
population covariance
now 1 r +1
finally, r always has the same sign as
+
II
III
IV

y
x
-
x
y
I
+
+
+
-
-
-
x i
x
x i

x
y
i
y
y
i
y
Quadrant
I + + +
II - + -
III - - +
IV + - -
x
i

x
y
i

y
x
i

x
( ) y
i

y ( )
note: covariance
xy

x
i

x
( ) y
i

y ( )
i 1
N


1
Regression 22
Now, if points look like:
This is true even if the slopes are identical
r in (a) is much greater than r in (b) since
there are fewer points in QII and IV in (a)
S
x y
> 0 sinc e most points are in QI and III
r > 0,
1
> 0
S
x y
0 since + points are offset by - points
r 0,
1
0
Sx y < 0 sinc e most points are in QII and IV
r < 0, 1< 0
Regression 23
we may simply test
or, we may use
Note:
whic h we used before
Testing Hypotheses concerning
xy
To test H
0
:
xy
0
vs. H
a
:
xy
0 ( or one - sided )
H
o
:
1
0
vs. H
a
:
1
0 as we learned before
t
r n 2
1 r
2
and t ~ t n 2 ( )
t
r n 2
1 r
2

1sx
s
y| x
n 1
e.g. in the age-SBP problem,
and
whic h is the same value as that obtained in the test for the slope.
r . 66
t
. 66 30 2
1 . 66 ( )
2
4. 62

Anda mungkin juga menyukai