Anda di halaman 1dari 14

Regression diagnostics

Diagnostics
The theoretical assumptions for the simple linear regression model with one response variable Y and one explicative variable x are:
Linearity: yi = 0 + 1 xi + ui , for i = 1, . . . , n Homogeneity: E [ui ] = 0, for i = 1, . . . , n Homoscedasticity: Var[ui ] = 2 , for i = 1, . . . , n Independence: ui and uj are independent for i = j Normality: ui N(0, 2 ), for i = 1, . . . , n

We study how to apply diagnostic procedures to test if these assumptions are appropriate for the available data (xi , yi )
Based on the analysis of the residuals ei = yi yi

Diagnostics: scatterplots

Scatterplots
The simplest diagnostic procedure is based on the visual examination of the scatterplot for (xi , yi ) Often this simple but powerful method reveals patterns suggesting whether the theoretical model might be appropriate or not We illustrate its application on a classical example. Consider the four following datasets

Diagnostics: scatterplots
The Anscombe datasets 132
DATASET

TWO-VARIABLE LINEAR REGRESSION


TABLE 3-10 Four Data Sets

1
Y 8.04 6.95 7.58 8 .81 8.33 9.96 7.24 4.26 10.84 4.82 5.68 X 10.0 8.0 13.0 9.0 11.0 14.0 6.0 4.0 12.0 7.0 5.0

DATASET

2
Y 9.14 8.14 8.74 8.77 9.26 8.10 6.13 3.10 9.13 7.26 4.74

X 10.0 8.0 1'3.0 9.0 11.0 14.0 6.0 4.0 12.0 7.0 5.0
DATASET

3
Y 7.46 6.77 12.74 7.11 7.81 8.84 6.08 5.39 8.15 6.42 5.73 X 8.0 8.0 8.0 8.0 8.0 8.0 8.0 19.0 8.0 8.0 8.0

DATASET

4
Y 6.58 5.76 7.71 8.84 8.47 7.04 5.25 12.50 5.56 7.91 6.89

X 10.0 8.0 13.0 9.0 11.0 14.0 6.0 4.0 12.0 7.0 5.0
SOURCE:

F . J . Anscombe, op. cit.

mean of X = 9.0, mean of Y = 7.5, for all four data sets.

Diagnostics: scatterplots
The Anscombe example
The estimated regression model for each of the four previous datasets is the same
yi = 3,0 + 0,5xi n = 11, x = 9,0, y = 7,5, rxy = 0,817

The estimated standard error of the estimator 1 ,


2 sR 2 (n 1)sx

takes the value 0,118 in all four cases. The corresponding T statistic takes the value T = 0,5/0,118 = 4,237 But the corresponding scatterplots show that the four datasets are quite dierent. Which conclusions could we reach from these diagrams?

Diagnostics: scatterplots
Anscombe data scatterplots 133
10
TWO-VARIABLE LINEAR REGRESSION

10

10
Data Set 1

15

10
Data Set 2

15

20

10

10

10
Data Set 3

15

20

10
Data Set 4

15

20

FIGURE
SOURCE:

3-29 Scatterplots for the four data sets of Table 3-10 F. J . Anscombe, op cit.

(1) numerical calculations are exact, but graphs are rough;

Residual analysis
Further analysis of the residuals
If the observation of the scatterplot is not sucient to reject the model, a further step would be to use diagnosis methods based on the analysis of the residuals ei = yi yi This analysis starts by standardizing the residuals, that is, dividing them by the residuals (quasi-)standard deviation sR . The resulting quantities are known as standardized residuals: ei sR Under the assumptions of the linear regression model, the standardized residuals are approximately independent standard normal random variables A plot of these standardized residuals should show no clear pattern

Residual analysis

Residual plots
Several types of residual plots can be constructed. The most common ones are:
Plot of standardized residuals vs. x Plot of standardized residuals vs. y (the predicted responses)

Deviations from the model hypotheses result in patterns on these plots, which should be visually recognizable

Residual plots examples


5.1: of consistencia con o ConsistencyEj: the theoretical model el modelo terico

Residual plots examples


Nonlinearity
5.1: Ej: No linealidad

Residual plots examples


5.1: Ej: Heterocedasticidad

Heteroscedasticity

Residual analysis

Outliers
In a plot of the regression line we may observe outlier data, that is, data that show signicant deviations from the regression line (or from the remaining data) The parameter estimators for the regression model, 0 and 1 , are very sensitive to these outliers It is important to identify the outliers and make sure that they really are valid data Statgraphics is able to show for example the data that generate Unusual residuals or Inuential points

Residual analysis

Normality of the errors


One of the theoretical assumptions of the linear regression model is that the errors follow a normal distribution This assumption can be checked visually from the analysis of the residuals ei , using dierent approaches:
By inspection of the frequency histogram for the residuals By inspection of the Normal Probability Plot of the residuals (signicant deviations of the data from the straight line in the plot correspond to signicant departures from the normality assumption)

Nonlinear relationships and linearizing transformations


Introduction
Consider the case when the deterministic part f (xi ; a, b) of the response in the model yi = f (xi ; a, b) + ui , i = 1, . . . , n

is a nonlinear function of x that depends on two parameters a and b (for example, f (x; a, b) = ab x ) In some cases we may apply transformations to the data to linearize them. We are then able to apply the linear regression procedure From the original data (xi , yi ) we obtain the transformed data (xi , yi ) The parameters 0 and 1 corresponding to the linear relation between xi and yi are transformations of the parameters a and b

Nonlinear relationships and linearizing transformations

Linearizing transformations
Examples of linearizing transformations:
If y = f (x; a, b) = ax b then log y = log a + b log x. We have
y = log y , x = log x, 0 = log a, 1 = b

If y = f (x; a, b) = ab x then log y = log a + (log b)x. We have


y = log y , x = x, 0 = log a, 1 = log b

If y = f (x; a, b) = 1/(a + bx) then 1/y = a + bx. We have


y = 1/y , x = x, 0 = a, 1 = b

If y = f (x; a, b) = log(ax b ) then y = log a + b(log x). We have


y = y , x = log x, 0 = log a, 1 = b

Anda mungkin juga menyukai