Estimating A Regression Line: F. Chiaromonte 1

ESTIMATING A REGRESSION LINE
F. Chiaromonte 1
This course is about REGRESSION ANALYSIS:
• Constructing quantitative descriptions of the statistical association between y
(response variable) and x (predictor, or explanatory variable) on the sample data.
• Introducing models, to interpret estimates and inferences on the parameters of
these descriptions in relation to the underlying population.
MULTIPLE regression, when we consider more than one predictor variable.

F. Chiaromonte 2
Fitting a line through bi-variate sample data (as a descriptor)
Least squares fit: find the line that minimizes the sum
of squared (vertical) distances from the sample points.
y = β 0 + β1 x generic equation of line

⎧ n ⎫
min ⎨∑ ( yi − ( β 0 + β1 xi )) 2 ⎬ obj fct
β0 , β1
⎩ i =1 ⎭
Normal equations (derivatives of obj fct) Solution (unique)
⎧ n n
⎪ ∑ yi = nβ 0 + β1 ∑ xi
n
⎪ i =1 i =1
∑ ( x − x )( y − y )
i i
⎨ n n n b1 = i =1
n
⎪ xy =β
⎪⎩∑ 0∑ i 1∑ i ∑ i
i i x + β x 2
( x − x ) 2
i =1 i =1 i =1 i =1
F. Chiaromonte b0 = y − b1 x 3
yî = b0 + b1 xi fitted value for sample point i=1…n
ei = yi − yî residual
Geometric properties of the least square line:
n
∑e
i =1
i =0
n
∑ i = min
e 2
i =1
n n
∑ yî = ∑ yi
e
i =1 i =1
(x-bar,y-bar)
n y
∑xe
y-hat
i i =0
i =1
n
∑ yˆ e
i =1
i i =0 x
y = b0 + b1 x
F. Chiaromonte 4
Simple linear regression MODEL:
Assume the sample data is generated as yi = β 0 + β1 xi + ε i
xi , i = 1...n fixed (or condition on)

ε i , i = 1...n random errors s.t.
E (ε i ) = 0, ∀i no systematic component
var(ε i ) = σ 2 , ∀i constant variance
cor(ε i , ε j ) = 0, ∀i ≠ j uncorrelated
The values of y (given various values of x) scatter about a line, with constant
variance and no correlations among the departures from the line carried by
different observations…
…quite simplistic, but very useful in many applications!
Note: distribution of errors is unspecified for now.
F. Chiaromonte 5
If we assumed a bell-shaped distribution for the errors (which we will do later!)
here is how the “population” picture would look like:
E ( yi ) = β 0 + β1 xi
var( yi ) = σ 2
cor( yi , y j ) = 0
Interpretation of parameters:
β1=slope; change in the mean of y when x changes by one unit.

β0=intercept; if x=0 is a meaningful value, mean of y when x=0.
σ2=error variance; scatter of y about the regression line.
F. Chiaromonte 6
Under the assumptions of our simple linear regression model, the slope and
intercept of the least square line are (point) estimates of the population slope and
intercept, with the following very important properties:
GAUSS-MARKOV THEOREM: Under the conditions of the simple linear

regression model:
• b1 and b0 are unbiased estimates for β1 and β0
E (b1 ) = β1 E (b0 ) = β 0
• they are the most accurate (smallest MSE i.e. variance) among all unbiased
estimates that can be computed as linear functions of the response values.
Linearity:
n n
∑ ( x − x )( y − y ) ∑ ( x − x ) y
i i i i n
b1 = i =1
n
= i =1
n
= ∑ ki yi
∑ (x − x )
i =1
i
2
∑ (x − x )
i =1
i
2 i =1
n n n
1
b0 = y − b1 x = ∑ yi − ∑ ki x yi = ∑ ki yi
F. Chiaromonte i =1 n i =1 i =1 7
Point estimation of error variance
SSE (error sum of squares)
n n
1 1
s =
2
∑
n − 2 i =1
( yi − (b0 + b x
1 i )) 2
= ∑
n − 2 i =1
ei
2
MSE for the regression line dof of SSE (constraints = two parameters of line)
Unbiased for σ2: E (s2 ) = σ 2
Point estimation of mean response
Eˆ ( y at x) = yˆ ( x) = b0 + b1 x
F. Chiaromonte 8
Example: Simple linear regression for y = log length ratio between human and
chicken DNA , on x = log large insertion ratio, as sampled on n=100 genome windows.
Estimates from least squares
Line parameters:
Intercept: 0.19210
Slope: 0.21777
Error variance: 0.033 on 98 dof’s
Mean responses:
yˆ (1) = 0.41
yˆ (2) = 0.63 … would you trust this?
F. Chiaromonte 9
Example: Simple linear regression for y = mortality rate due to malignant
Regression Plot skin
melanoma per 10 million people, on x = latitude, as sampled
Mortality =on 49
389.189 US
- 5.97764 states.
Latitude
S = 19.1150 R-Sq = 68.0 % R-Sq(adj) = 67.3 %
# State LAT MORT

1 Alabama 33.0 219
2 Arizona 34.5 160 200
3 Arkansas 35.0 170

4 California 37.5 182
Mortality
5 Colorado 39.0 149
... 150
49 Wyoming 43.0 134
Estimates from least squares 100
Line parameters: 30 40 50
Intercept: 389.19 Latitude
Slope: -5.977
Error variance: 365.57 on 47 dof’s
Mean responses: yˆ (30) = 209.88 … would you trust this?

yˆ (40) = 150.11
F. Chiaromonte 10
Maximum likelihood estimation under normality
xi , i = 1...n fixed (or condition on)
Assume (error distribution)

ε i , i = 1...n iid N (0,σ 2 )
Then
yi ~ N (β0 + β1 xi , σ 2 ) and indep, i = 1...n
Likelihood function
n
1 ⎧ 1 ⎫
L(β0 , β1 , σ ) = ∏
2
exp ⎨− 2 ( yi − (β0 + β1 xi ))2 ⎬
i =1 2πσ 2 ⎩ 2σ ⎭
1 ⎧ 1 n 2⎫
= exp ⎨− 2 ∑ ( yi − (β0 + β1 xi )) ⎬
(2πσ )
2 n/ 2
⎩ 2σ i =1 ⎭
F. Chiaromonte 11
max2 L( β 0 , β1 , σ 2 ) obj fct
β 0 , β1 ,σ
Solution (unique)
n
∑ ( x − x )( y − y )
i i
b1 = i =1
n
∑ i
( x
i =1
− x ) 2
same as from least square fit!
b0 = y − b1 x
1 n n−2 2
s 2 = ∑ ( yi − (b0 + b1 xi )) 2 s
n i =1 n
F. Chiaromonte 12
Some remarks
• a strong statistical association between y and x does not automatically imply

causation (e.g. a functional relationship of linear form); x can “proxy” the real
causing variable, perhaps is a spurious fashion!
• In observational studies, x (likewise y) is not controlled by the researchers; we

“condition” on the observed values of x. In experimental studies, x is controlled;
we can consider the values of x as fixed (although assignment of x levels to units
may be arranged at random in an experimental design). Experimental design
facilitates causality assessment.
• Extrapolating a statistical association (e.g. a regression line) outside the range

of the data on which it was “fitted” is dangerous; we don’t really know how the
association would be shaped where we didn’t get to look! With an experimental
design we can make sure we cover the range that is of interest.
F. Chiaromonte 13

Estimating A Regression Line: F. Chiaromonte 1

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Estimating A Regression Line: F. Chiaromonte 1

Diunggah oleh

Hak Cipta:

Format Tersedia

ESTIMATING A REGRESSION LINE

MULTIPLE regression, when we consider more than one predictor variable.

y = β 0 + β1 x generic equation of line

Normal equations (derivatives of obj fct) Solution (unique)

Assume the sample data is generated as yi = β 0 + β1 xi + ε i

xi , i = 1...n fixed (or condition on)

β1=slope; change in the mean of y when x changes by one unit.

GAUSS-MARKOV THEOREM: Under the conditions of the simple linear

Unbiased for σ2: E (s2 ) = σ 2

Point estimation of mean response

Estimates from least squares

Error variance: 0.033 on 98 dof’s

S = 19.1150 R-Sq = 68.0 % R-Sq(adj) = 67.3 %

# State LAT MORT

3 Arkansas 35.0 170

Estimates from least squares 100

Intercept: 389.19 Latitude

Error variance: 365.57 on 47 dof’s

Mean responses: yˆ (30) = 209.88 … would you trust this?

xi , i = 1...n fixed (or condition on)

Assume (error distribution)

• a strong statistical association between y and x does not automatically imply

• In observational studies, x (likewise y) is not controlled by the researchers; we

• Extrapolating a statistical association (e.g. a regression line) outside the range

Anda mungkin juga menyukai