F. Chiaromonte 1
This course is about REGRESSION ANALYSIS:
• Constructing quantitative descriptions of the statistical association between y
(response variable) and x (predictor, or explanatory variable) on the sample data.
• Introducing models, to interpret estimates and inferences on the parameters of
these descriptions in relation to the underlying population.
Least squares fit: find the line that minimizes the sum
of squared (vertical) distances from the sample points.
⎧ n n
⎪ ∑ yi = nβ 0 + β1 ∑ xi
n
⎪ i =1 i =1
∑ ( x − x )( y − y )
i i
⎨ n n n b1 = i =1
n
⎪ xy =β
⎪⎩∑ 0∑ i 1∑ i ∑ i
i i x + β x 2
( x − x ) 2
i =1 i =1 i =1 i =1
F. Chiaromonte b0 = y − b1 x 3
yˆi = b0 + b1 xi fitted value for sample point i=1…n
ei = yi − yˆi residual
Geometric properties of the least square line:
n
∑e
i =1
i =0
n
∑ i = min
e 2
i =1
n n
∑ yˆi = ∑ yi
e
i =1 i =1
(x-bar,y-bar)
n y
∑xe
y-hat
i i =0
i =1
n
∑ yˆ e
i =1
i i =0 x
y = b0 + b1 x
F. Chiaromonte 4
Simple linear regression MODEL:
The values of y (given various values of x) scatter about a line, with constant
variance and no correlations among the departures from the line carried by
different observations…
…quite simplistic, but very useful in many applications!
Note: distribution of errors is unspecified for now.
F. Chiaromonte 5
If we assumed a bell-shaped distribution for the errors (which we will do later!)
here is how the “population” picture would look like:
E ( yi ) = β 0 + β1 xi
var( yi ) = σ 2
cor( yi , y j ) = 0
Interpretation of parameters:
F. Chiaromonte 6
Under the assumptions of our simple linear regression model, the slope and
intercept of the least square line are (point) estimates of the population slope and
intercept, with the following very important properties:
E (b1 ) = β1 E (b0 ) = β 0
• they are the most accurate (smallest MSE i.e. variance) among all unbiased
estimates that can be computed as linear functions of the response values.
Linearity:
n n
∑ ( x − x )( y − y ) ∑ ( x − x ) y
i i i i n
b1 = i =1
n
= i =1
n
= ∑ ki yi
∑ (x − x )
i =1
i
2
∑ (x − x )
i =1
i
2 i =1
n n n
1
b0 = y − b1 x = ∑ yi − ∑ ki x yi = ∑ ki yi
F. Chiaromonte i =1 n i =1 i =1 7
Point estimation of error variance
SSE (error sum of squares)
n n
1 1
s =
2
∑
n − 2 i =1
( yi − (b0 + b x
1 i )) 2
= ∑
n − 2 i =1
ei
2
MSE for the regression line dof of SSE (constraints = two parameters of line)
Eˆ ( y at x) = yˆ ( x) = b0 + b1 x
F. Chiaromonte 8
Example: Simple linear regression for y = log length ratio between human and
chicken DNA , on x = log large insertion ratio, as sampled on n=100 genome windows.
Line parameters:
Intercept: 0.19210
Slope: 0.21777
Mean responses:
yˆ (1) = 0.41
yˆ (2) = 0.63 … would you trust this?
F. Chiaromonte 9
Example: Simple linear regression for y = mortality rate due to malignant
Regression Plot skin
melanoma per 10 million people, on x = latitude, as sampled
Mortality =on 49
389.189 US
- 5.97764 states.
Latitude
Mortality
5 Colorado 39.0 149
... 150
49 Wyoming 43.0 134
Line parameters: 30 40 50
Slope: -5.977
F. Chiaromonte 10
Maximum likelihood estimation under normality
Likelihood function
n
1 ⎧ 1 ⎫
L(β0 , β1 , σ ) = ∏
2
exp ⎨− 2 ( yi − (β0 + β1 xi ))2 ⎬
i =1 2πσ 2 ⎩ 2σ ⎭
1 ⎧ 1 n 2⎫
= exp ⎨− 2 ∑ ( yi − (β0 + β1 xi )) ⎬
(2πσ )
2 n/ 2
⎩ 2σ i =1 ⎭
F. Chiaromonte 11
max2 L( β 0 , β1 , σ 2 ) obj fct
β 0 , β1 ,σ
Solution (unique)
n
∑ ( x − x )( y − y )
i i
b1 = i =1
n
∑ i
( x
i =1
− x ) 2
same as from least square fit!
b0 = y − b1 x
1 n n−2 2
s 2 = ∑ ( yi − (b0 + b1 xi )) 2 s
n i =1 n
F. Chiaromonte 12
Some remarks
F. Chiaromonte 13