Anda di halaman 1dari 11


Autocorrelation is a characteristic of data in which the correlation between the values of

the same variables is based on related objects. It violates the assumption of instance
independence, which underlies most of the conventional models. It generally exists in those
types of data-sets in which the data, instead of being randomly selected, is from the same
The presence of autocorrelation is generally unexpected by the researcher. It occurs mostly
due to dependencies within the data. Its presence is a strong motivation for those
researchers who are interested in relational learning and inference.
In order to understand autocorrelation, we can discuss some instances that are based upon
cross sectional and time series data. In cross sectional data, if the change in the income of
a person A affects the savings of person B (a person other than person A), then
autocorrelation is present. In the case of time series data, if the observations show intercorrelation, specifically in those cases where the time intervals are small, then these intercorrelations are given the term of autocorrelation.
In time series data, autocorrelation is defined as the delayed correlation of a given series.
Autocorrelation is a delayed correlation by itself, and is delayed by some specific number of
time units. On the other hand, serial autocorrelation is that type which defines the lag
correlation between the two series in time series data.


The assumption of homoscedasticity (literally, same variance) is

central to linear regression models. Homoscedasticity describes a
situation in which the error term (that is, the noise or random
disturbance in the relationship between the independent variables
and the dependent variable) is the same across all values of the
independent variables. Heteroscedasticity (the violation of
homoscedasticity) is present when the size of the error term diff ers
across values of an independent variable. The impact of violating
the assumption of homoscedasticity is a matter of degree, increasing
as heteroscedasticity increases.
A simple bivariate example can help to illustrate heteroscedasticity:
Imagine we have data on family income and spending on luxury
items. Using bivariate regression, we use family income to predict
luxury spending (as expected, there is a strong, positive association

between income and spending). Upon examining the residuals we

detect a problem the residuals are very small for low values of
family income (families with low incomes dont spend much on
luxury items) while there is great variation in the size of the
residuals for wealthier families (some families spend a great deal on
luxury items while some are more moderate in their luxury
spending). This situation represents heteroscedasticity because the
size of the error varies across values of the independent variable.
Examining the scatterplot of the residuals against the predicted
values of the dependent variable would show the classic coneshaped pattern of heteroscedasticity.
The problem that heteroscedasticity presents for regression models
is simple. Recall that ordinary least-squares (OLS) regression seeks
to minimize residuals and in turn produce the smallest possible
standard errors. By defi nition OLS regression gives equal weight to
all observations, but when heteroscedasticity is present the cases
with larger disturbances have more pull than other observations.
The coeffi cients from OLS regression where heteroscedasticity is
present are therefore ineffi cient but remain unbiased. In this case,
weighted least squares regression would be more appropriate, as it
downweights those observations with larger disturbances.
A more serious problem associated with heteroscedasticity is the
fact that the standard errors are biased. Because the standard error
is central to conducting signifi cance tests and calculating confi dence
intervals, biased standard errors lead to incorrect conclusions about

the signifi cance of the regression coeffi cients. Many statistical

programs provide an option of robust standard error to correct this
bias; weighted least squares regression also addresses this concern
but requires a number of additional assumptions. Another approach
for dealing with heteroscedasticity is to transform the dependent
variable using one of the variance stabilizing transformations. A
logarithmic transformation can be applied to highly skewed
variables, while count variables can be transformed using a square
root transformation. Overall, the violation of the homoscedasticity
assumption must be quite severe in order to present a major
problem given the robust nature of OLS regression.


An important assumption assumed by the classical linear regression model is

that the error term should be homogeneous in nature. Whenever that
assumption is violated, then one can assume that heteroscedasticity has
occurred in the data.
Statistics Solutions is the country's leader in examining heteroscedasticity
and dissertation statistics help. Contact Statistics Solutions today for a free
30-minute consultation.
An example can help better explain Heteroscedasticity.
Consider an income saving model in which the income of a person is
regarded as the independent variable, and the savings made by that
individual is regarded as the dependent variable for heteroscedasticity. So,
as the value of the income of that individual increases, simultaneously the
savings also increase. But in the presence of heteroscedasticity, the graph
would depict something unusual for example there would be an increase in
the income of the individual but the savings of the individual would remain
This example also signifies the major difference between heteroscedasticity
and homoscedasticity. Heteroscedasticity is mainly due to the presence of
outlier in the data. Outlier in Heteroscedasticity means that the observations
that are either small or large with respect to the other observations are
present in the sample.
Heteroscedasticity is also caused due to omission of variables from the
model. Considering the same income saving model, if the variable income is

deleted from the model, then the researcher would not be able to interpret
anything from the model.
Heteroscedasticity is more common in cross sectional types of data than in
time series types of data. If the process of ordinary least squares (OLS) is
performed by taking into account heteroscedasticity explicitly, then it would
be difficult for the researcher to establish the process of the confidence
intervals and the tests of hypotheses. Due to the presence of
heteroscedasticity, the variance that is obtained by the researcher should be
of lesser value than the value of the variance of the best linear unbiased
estimator (BLUE). Therefore, the results obtained by the researcher through
significant tests would be inaccurate because of the presence of

Multicollinearity is a state of very high intercorrelations or inter-associations
among the independent variables. It is therefore a type of disturbance in the
data, and if present in the data the statistical inferences made about the data
may not be reliable.
There are certain reasons why multicollinearity occurs:
It is caused by an inaccurate use of dummy variables.
It is caused by the inclusion of a variable which is computed from other

variables in the data set.

Multicollinearity can also result from the repetition of the same kind of

Generally occurs when the variables are highly correlated to each other.

Multicollinearity can result in several problems. These problems are as

The partial regression coefficient due to multicollinearity may not be

estimated precisely. The standard errors are likely to be high.

Multicollinearity results in a change in the signs as well as in the

magnitudes of the partial regression coefficients from one sample to

another sample.
Multicollinearity makes it tedious to assess the relative importance of the

independent variables in explaining the variation caused by the

dependent variable.
In the presence of high multicollinearity, the confidence intervals of the
coefficients tend to become very wide and the statistics tend to be very small. It
becomes difficult to reject the null hypothesis of any study when
multicollinearity is present in the data under study.

DEFINITION of 'Goodness-Of-Fit'
Used in statistics and statistical modelling to compare an anticipated frequency to
an actual frequency. Goodness-of-fit tests are often used in business decision
making. In order to calculate a chi-square goodness-of-fit, it is necessary to first
state the null hypothesis and the alternative hypothesis, choose a significance
level (such as = 0.5) and determine the critical value.

Read more: Goodness-Of-Fit Definition |

Follow us: Investopedia on Facebook

Sum of Squares
DEFINITION of 'Sum of Squares'
A statistical technique used in regression analysis. The sum of squares is a
mathematical approach to determining the dispersion of data points. In a
regression analysis, the goal is to determine how well a data series can be fitted
to a function which might help to explain how the data series was generated. The
sum of squares is used as a mathematical way to find the function which best fits
(varies least) from the data.

In order to determine the sum of squares the distance between each data point
and the line of best fit is squared and then all of the squares are summed up. The
line of best fit will minimize this value.

Next Up




BREAKING DOWN 'Sum Of Squares'

There are two methods of regression analysis which use the sum of squares: the
linear least squares method and the non-linear least squares method. Least
squares refers to the fact that the regression function minimizes the sum of the
squares of the variance from the actual data points. In this way, it is possible to
draw a function which statistically provides the best fit for the data. A regression
function can either be linear (a straight line) or non-linear (a curving line).

Read more: Sum Of Squares Definition |


Least Squares
DEFINITION of 'Least Squares'
A statistical method used to determine a line of best fit by minimizing the sum of
squares created by a mathematical function. A "square" is determined by
squaring the distance between a data point and the regression line. The least
squares approach limits the distance between a function and the data points that

a function is trying to explain. It is used in regression analysis, often in nonlinear

regression modeling in which a curve is fit into a set of data.

Next Up



BREAKING DOWN 'Least Squares'

The least squares approach is a popular method for determining regression
equations. Instead of trying to solve an equation exactly, mathematicians use the
least squares to make a close approximation (referred to as a maximumlikelihood estimate). Modeling methods that are often used when fitting a function
to a curve include the straight line method, polynomial method, logarithmic
method and Gaussian method.

Read more: Least Squares Definition |

Follow us: Investopedia on Facebook