Anda di halaman 1dari 2

Cautions about Correlation and Regression | SHUBLEKA


™ The square of correlation is the fraction of variation in y values that is explained by the least squares
regression of y on x.
™ Data transformation = applying functions such as the logarithm can simplify statistical analysis

Residual = Observed – Predicted = y − y

Geometrically: distance from each point to the least squares regression line.

Î Examining residuals helps assess how well the line describes the data
Î Special property: the mean of the least-squares residuals is always zero.
Î Residual plot = scatterplot of the regression residuals against the explanatory variable
Î Use residual plots to assess the fit of a regression line
Î If the regression line captures the overall pattern of the data, there should be no pattern in the residuals
Î Look for striking individual points as well as for an overall pattern


¾ Outlier = a point that lies outside the overall pattern

¾ In the x-direction can have a strong influence on the position of the regression line
¾ In the y-direction have large residuals

Influential points:

¾ A point is influential if removing it significantly changes the regression line. Outliers in the x
direction are often influential points.
¾ Demonstration: Correlation and Regression Applet


¾ Correlation measures only linear association, and fitting a straight line makes sense only when the
overall pattern is linear. Always plot the data before calculating.
¾ Extrapolation often produces unreliable predictions
¾ Correlation and Least Squares Regression are not resistant. Always plot the data and look for
potentially influential points.

Cautions about Correlation and Regression | SHUBLEKA

¾ Lurking variable = a variable that is not among the explanatory or response variables in a study and
yet may influence the interpretation of the relationships among those variables
¾ Association does not imply causation
¾ A correlation based on averages is usually higher than if we used data for individuals
¾ A correlation based on data with restricted range problem is often lower than would be the case if
could observe the full range of the variables
¾ Demonstration: TI83/84 residual plot (L3= Y1(L1), L4 = L2 – L3)