Anda di halaman 1dari 13

ESTIMATING A REGRESSION LINE

F. Chiaromonte 1
This course is about REGRESSION ANALYSIS:
• Constructing quantitative descriptions of the statistical association between y
(response variable) and x (predictor, or explanatory variable) on the sample data.
• Introducing models, to interpret estimates and inferences on the parameters of
these descriptions in relation to the underlying population.

MULTIPLE regression, when we consider more than one predictor variable.


F. Chiaromonte 2
Fitting a line through bi-variate sample data (as a descriptor)

Least squares fit: find the line that minimizes the sum
of squared (vertical) distances from the sample points.

y = β 0 + β1 x generic equation of line


⎧ n ⎫
min ⎨∑ ( yi − ( β 0 + β1 xi )) 2 ⎬ obj fct
β0 , β1
⎩ i =1 ⎭

Normal equations (derivatives of obj fct) Solution (unique)

⎧ n n

⎪ ∑ yi = nβ 0 + β1 ∑ xi
n

⎪ i =1 i =1
∑ ( x − x )( y − y )
i i
⎨ n n n b1 = i =1
n
⎪ xy =β
⎪⎩∑ 0∑ i 1∑ i ∑ i
i i x + β x 2
( x − x ) 2

i =1 i =1 i =1 i =1

F. Chiaromonte b0 = y − b1 x 3
yˆi = b0 + b1 xi fitted value for sample point i=1…n
ei = yi − yˆi residual
Geometric properties of the least square line:
n

∑e
i =1
i =0
n

∑ i = min
e 2

i =1
n n

∑ yˆi = ∑ yi
e
i =1 i =1
(x-bar,y-bar)
n y
∑xe
y-hat
i i =0
i =1
n

∑ yˆ e
i =1
i i =0 x

y = b0 + b1 x
F. Chiaromonte 4
Simple linear regression MODEL:

Assume the sample data is generated as yi = β 0 + β1 xi + ε i

xi , i = 1...n fixed (or condition on)


ε i , i = 1...n random errors s.t.
E (ε i ) = 0, ∀i no systematic component
var(ε i ) = σ 2 , ∀i constant variance
cor(ε i , ε j ) = 0, ∀i ≠ j uncorrelated

The values of y (given various values of x) scatter about a line, with constant
variance and no correlations among the departures from the line carried by
different observations…
…quite simplistic, but very useful in many applications!
Note: distribution of errors is unspecified for now.

F. Chiaromonte 5
If we assumed a bell-shaped distribution for the errors (which we will do later!)
here is how the “population” picture would look like:

E ( yi ) = β 0 + β1 xi
var( yi ) = σ 2
cor( yi , y j ) = 0

Interpretation of parameters:

β1=slope; change in the mean of y when x changes by one unit.


β0=intercept; if x=0 is a meaningful value, mean of y when x=0.
σ2=error variance; scatter of y about the regression line.

F. Chiaromonte 6
Under the assumptions of our simple linear regression model, the slope and
intercept of the least square line are (point) estimates of the population slope and
intercept, with the following very important properties:

GAUSS-MARKOV THEOREM: Under the conditions of the simple linear


regression model:
• b1 and b0 are unbiased estimates for β1 and β0

E (b1 ) = β1 E (b0 ) = β 0

• they are the most accurate (smallest MSE i.e. variance) among all unbiased
estimates that can be computed as linear functions of the response values.
Linearity:
n n

∑ ( x − x )( y − y ) ∑ ( x − x ) y
i i i i n
b1 = i =1
n
= i =1
n
= ∑ ki yi
∑ (x − x )
i =1
i
2
∑ (x − x )
i =1
i
2 i =1

n n n
1
b0 = y − b1 x = ∑ yi − ∑ ki x yi = ∑ ki yi
F. Chiaromonte i =1 n i =1 i =1 7
Point estimation of error variance
SSE (error sum of squares)
n n
1 1
s =
2

n − 2 i =1
( yi − (b0 + b x
1 i )) 2
= ∑
n − 2 i =1
ei
2

MSE for the regression line dof of SSE (constraints = two parameters of line)

Unbiased for σ2: E (s2 ) = σ 2

Point estimation of mean response

Eˆ ( y at x) = yˆ ( x) = b0 + b1 x

F. Chiaromonte 8
Example: Simple linear regression for y = log length ratio between human and
chicken DNA , on x = log large insertion ratio, as sampled on n=100 genome windows.

Estimates from least squares

Line parameters:
Intercept: 0.19210
Slope: 0.21777

Error variance: 0.033 on 98 dof’s

Mean responses:
yˆ (1) = 0.41
yˆ (2) = 0.63 … would you trust this?

F. Chiaromonte 9
Example: Simple linear regression for y = mortality rate due to malignant
Regression Plot skin
melanoma per 10 million people, on x = latitude, as sampled
Mortality =on 49
389.189 US
- 5.97764 states.
Latitude

S = 19.1150 R-Sq = 68.0 % R-Sq(adj) = 67.3 %

# State LAT MORT


1 Alabama 33.0 219
2 Arizona 34.5 160 200

3 Arkansas 35.0 170


4 California 37.5 182

Mortality
5 Colorado 39.0 149
... 150
49 Wyoming 43.0 134

Estimates from least squares 100

Line parameters: 30 40 50

Intercept: 389.19 Latitude

Slope: -5.977

Error variance: 365.57 on 47 dof’s

Mean responses: yˆ (30) = 209.88 … would you trust this?


yˆ (40) = 150.11

F. Chiaromonte 10
Maximum likelihood estimation under normality

xi , i = 1...n fixed (or condition on)

Assume (error distribution)


ε i , i = 1...n iid N (0,σ 2 )
Then
yi ~ N (β0 + β1 xi , σ 2 ) and indep, i = 1...n

Likelihood function
n
1 ⎧ 1 ⎫
L(β0 , β1 , σ ) = ∏
2
exp ⎨− 2 ( yi − (β0 + β1 xi ))2 ⎬
i =1 2πσ 2 ⎩ 2σ ⎭
1 ⎧ 1 n 2⎫
= exp ⎨− 2 ∑ ( yi − (β0 + β1 xi )) ⎬
(2πσ )
2 n/ 2
⎩ 2σ i =1 ⎭

F. Chiaromonte 11
max2 L( β 0 , β1 , σ 2 ) obj fct
β 0 , β1 ,σ

Solution (unique)
n

∑ ( x − x )( y − y )
i i
b1 = i =1
n

∑ i
( x
i =1
− x ) 2
same as from least square fit!

b0 = y − b1 x

1 n n−2 2
s 2 = ∑ ( yi − (b0 + b1 xi )) 2 s
n i =1 n

F. Chiaromonte 12
Some remarks

• a strong statistical association between y and x does not automatically imply


causation (e.g. a functional relationship of linear form); x can “proxy” the real
causing variable, perhaps is a spurious fashion!

• In observational studies, x (likewise y) is not controlled by the researchers; we


“condition” on the observed values of x. In experimental studies, x is controlled;
we can consider the values of x as fixed (although assignment of x levels to units
may be arranged at random in an experimental design). Experimental design
facilitates causality assessment.

• Extrapolating a statistical association (e.g. a regression line) outside the range


of the data on which it was “fitted” is dangerous; we don’t really know how the
association would be shaped where we didn’t get to look! With an experimental
design we can make sure we cover the range that is of interest.

F. Chiaromonte 13

Anda mungkin juga menyukai