Anda di halaman 1dari 12

Multicollinearity

Definition or the Nature of Multicollinearity:


The term Multicollinearity means the existence of a perfect, or exact, linear relationship among some or all explanatory variables of a regression model. For the k -variable regression involving explanatory variable

X 1 , X 2 , , X k (where X 1 = 1 for all observations to allow for the intercept term), an exact linear
relationship is said to exist if the following condition is satisfied.

1 X 1 + 2 X 2 + + k X k = 0

(1)

where 1 , 2 , , k are constants such that not all of them are zero simultaneously.

Today, however the term Multicollinearity is also used as the case where the not perfectly so, as follows:

X variables are intercorrelated but

1 X 1 + 2 X 2 + + k X k + vi = 0
where v i is a stochastic error term.

( 2)

Multicollinearity may also be express the nonlinear relationships among the following regression model

X variables. Consider the

Yi = 0 + 1 X i + 2 X i2 + k X i3 + u i
where, say,

( 3)

Y is the total cost of production and X is the output. The variables X i2 and X i3 are obviously

functionally related to X i , but the relationship is nonlinear.

Note:
The difference between perfect and less than perfect Multicollinearity: Let 2 0 then (1) can be written as

X 2i =

1 X 1i 3 X 3i k X ki 2 2 2

( 4)

which shows how X 2 is exactly linearly related to other variables or how it can be derived from a linear combination of other

X variables. In this situation, the coefficient of correlation between the variable X 2 and

the linear combination on the right side of ( 4 ) is bound to be unity. Similarly, if 2 0 then ( 2 ) can be written as

X 2i =

1 1 X 1i 3 X 3i k X ki vi 2 2 2 2

( 5)

which shows that X 2 is not an exact linear combination of other stochastic error term v i .

X s because it is also determined by the

Why does the Classical Linear Regression Model Assume that There is no Collinearity Among X s?
The reason is, if multicollinearity is perfect the regression coefficients of the

X variables are indeterminate and

their standard errors are infinite. If multicollinearity is less than perfect the regression coefficients although determinate, possess large standard errors (in relation to the coefficients themselves), which means the coefficients cannot be estimated with great precision or accuracy.

Sources of Multicollinearity:
Multicollinearity may be due to the following factors. a) There is a tendency of economic variables to move together over time. relationship. c) The data collection method employed For example, sampling over a limited range of the values taken by the regressors in the population. d) Constraints on the model or in the population being sampled For example, in the regression of electricity consumption on income

b) The use of lagged values of some explanatory variables as separate independent factors in the

(X2)

and house size

( X3) ,

there is a physical constraint in the population in that families with higher incomes generally have larger homes than families with lower incomes. e) Model specification For example, adding polynomial terms to a regression model, especially when the range of the variable is small. f) An over-determined model This happens when the model has more explanatory variables that the number of observations. This could happen in medical research where there may be a small number of patients about whom information is collected on a large number of variables.

Theorem: If the intercorrelation between the explanatory variables is perfect rxi x j = 1 , then
a) The estimates of the coefficients are indeterminate, and b) The standard errors of these estimates become infinitely large.

Proof (a):
Suppose that the relation to be estimates is Y = 0 + 1 X 1 + 2 X 2 + u and that X 1 and X 2 are related with the exact relation X 2 = kX 1 , where k is any arbitrary constant number.
and are The formulae for the estimation of the coefficients 1 2

( x1 y )( x 22 ) ( x 2 y )( x1 x 2 ) ( x12 )( x22 ) ( x1 x2 ) 2 2 ( ) ( x1 y )( x1 x2 ) x 2 y )( x1 2 = ( x12 )( x22 ) ( x1 x2 ) 2


= 1
Substituting kX 1 for X 2 we obtain
2 2 k 2 ( x1 y ) x1 k 2 ( x1 y ) x1 0 = = 1 2 2 2 2 2 2 0 k x1 k x1

( ) ( ) k ( x y )( x ) k ( x y )( x ) 0 = = 0 k ( x ) k ( x )
1 2 2 1 2 2 1 1 2 1 2 2 2 1

Therefore the parameters are indeterminate i.e. there is no way of finding separate values of each coefficient.

Proof (b):
If rxi x j = 1 , the standard errors of the estimates become infinitely large. In the two variable model

Y = 0 + 1 X 1 + 2 X 2 + u
If X 1 and X 2 are perfectly correlated X 2 = kX 1 , where k is any arbitrary constant number the variances
and will be of 1 2
2 x2 ( ) 2 x12 x22 ( x1 x2 ) 2 x1 2 V ( 2 ) = u 2 x12 x22 ( x1 x 2 )

=2 V 1 u

and

Substituting kX 1 for X 2 we obtain

=2 V 1 u =2 V 2 u

( )

and

( )

x12 2 2 k 2 x1 x12 k 2 ( x12 )

2 k 2 x1 x12 k 2

2 k 2 x2

( x )

2 2 1

2 =u

x22
0

2 = u

x12
0

Thus the variances of the estimates become infinite.

Why are the Estimates of the Coefficients Indeterminate?


given the rage of change in the average value of Y as X changes by a unit, holding X constant. But 2 3 2

if X 3 and X 2 are perfectly collinear, there is no way X 3 can be kept constant. It means that there is no way to get the separate influences of X 3 and X 2 from the given sample.

Estimation in the Presence of Perfect Multicollinearity:


We can write the three variable regression model as

x + x +u i yi = 2 2i 3 3i

(1)

Since X 2 and X 3 are perfectly correlated, we may assume that they are related as X 3i = kX 2i . Substitute

X 3i = kX 2i into (1) we get,


i y i = 2 x 2i + 3 ( kx2i ) + u y = + k x + u
i

2i

x 2i + u i yi =
Applying the usual OLS formula to ( 2 ) , we get

= 2 + k3 where,

( 2)

+ k = = 2 3

) xx y

2i i 2 2i

( 3)

Therefore, although we can estimate mathematically


+ k = 2 3

uniquely, there is no way to estimate 2 and 3 uniquely;

( 4)

gives us only one equation in two unknowns and there is an infinity of solutions to

( 4)

for given values of

and k . So in the case of perfect multicollinearity one cannot get a unique solution for the individual regression coefficients. But one can get a unique solution for linear combinations of these coefficients. The linear
+ k is uniquely estimated by combination 2 3

, given the value of k .

Estimation in the Presence of High But Imperfect Multicollinearity:


We can write the three variable regression model as
x + x +u i yi = 2 2i 3 3i

(1)

Since X 2 and X 3 are not perfectly correlated, we may assume that they are related as

X 3i = kX 2i + vi

( 2)
=0 .

where k 0 and vi is a stochastic error term such that

x 2i vi

In this case, estimation of regression coefficients 2 and 3 may be possible. Now, Substitute ( 2 ) into

= 2

We get ,
where

( x2i yi )( x32i ) ( x3i yi )( x2i x3i ) ( x22i )( x32i ) ( x2i x3i ) 2 2 2 2 ( x 2i y i ) ( 2 x 2 i + v i ) ( x 2i y i + y i v i ) ( x 2i ) =

( x )( x + v ) ( x )
2 2i 2 2 2i 2 i

2 2 2i

( 3)

x 2i vi

. = 0 . A similar expression can be derived for 3

Now, like perfect multicollinearity, there is no reason that

( 3)

cannot be estimated. Of course, if v i is

sufficiently small, say, very close to zero, ( 2 ) will indicate almost perfect collinearity and we shall be back to eh indeterminate case of perfect collinearity.

Practical Consequences of Multicollinearity:

In cases of near or high multicollinearity, one is likely to encounter the following consequences: 1. 2. 3. Although BLUE, the OLS estimators have large variances and co-variances. Because of consequence 1, the confidence intervals tend to be much wider, leading to the acceptance of the zero null hypothesis (i.e., the true population coefficient is zero) more readily. Also because of consequence 1, the insignificant. 4. Although the

t ratio of one or more coefficient tends to be statistically

t ratio of one or more coefficients is statistically insignificant, R 2 , the overall measure

of goodness of fit, can be very high. 5. The OLS estimators and their standard errors can be sensitive to small changes in the data.

Large Variances and Covariances of OLS Estimators:


x + x +u i For the model y i = 2 2i 3 3i

(1)

and are given by The variances and covariances of 2 3

= V 2 = V 3

( ) ( ) (

2 2 x 22i 1 r23

) )
r23 2
2 2i 2 3i

( 2) ( 3) ( 4)

2 2 x32i 1 r23

, = cov 2 3

(1 r ) x x
2 23

where r23 is the coefficient of correlation between X 2 and X 3 . It is apparent from ( 2 ) and

( 3)

that as r23 tends toward

1 , that is as collinearity increases, the variances of

the two estimators increase and in the limit when r23 = 1 , they are infinite. It is equally clear from ( 4 ) the as

r23 increases toward 1, the covariance of the two estimators also increases in absolute value.
The speed with which variances and covariances increase can be seen with the variance-inflating factor (VIF, which is defined as

VIF =

(1 r )
2 23

( 5)

2 VIF shows how the variance of an estimator is inflated by the presence of multicollinearity. As r23 approaches

1 , the VIF approaches infinity. That is, as the extent of collinearity increases, the variance of an estimator
increases, and in the limit it can become infinite. So we can write
2 = V VIF 2 2 x2 i 2 = V VIF 3 2 x3 i

( ) ( )

and are directly proportional to the VIF. which show that the variance of 2 3

For the k -variable model, the variance of the k th coefficient can be expressed as:
2 = V j xi2

( )

1 1 R2 j

( 6)
j

is the partial regression coefficient of regressor X where, j

2 , R j is the R 2 in the regression of X

on

the remaining ( k 2 ) regressions. We can also write (6 ) as


2 = V VIF j j xi2

( )

The inverse of the VIF is called tolerance (TOL). That is,

TOL j =

1 = 1 R2 j VIF j

( 7)

2 2 When R j =1 (i.e., perfect collinearity), TOL j = 0 and when R j = 0 (i.e., no collinearity whatsoever),

TOL j is

1 . Because of the intimate connection between VIF and TOL, one can use them interchangeably.

Wider Confidence Intervals:


Because of large standard errors, the confidence intervals for the relevant population parameters tend to be larger. Therefore, in cases of high multicollinearity, the sample data may be compatible with a diverse set of hypotheses. Hence, the probability of accepting a false hypothesis (i.e., type II error) increases.

Insignificant t Ratio:
To test the null hypothesis that, say, 2 = 0 , we use the

t ratio, that is,

2 , and compare the estimated se 2

( )

t value with the critical t value from the t table. But in cases of high collinearity the estimated standard
errors increase dramatically, thereby making the

t values smaller. Therefore, in such cases, one will increasingly

accept the null hypothesis that the relevant true population value is zero.

A High R 2 But Few Significant t Ratios:


In cases of high collinearity, it is possible, that one or more of the partial slop coefficients are individually statistically insignificant on the basis of the

t test. Yet the R 2 is such situations may be so high. The real

problem is the covariances between the estimators, which are related to the correlations between the regressors.

Sensitivity of OLS Estimators and Their Standard Errors to Small Changes in Data:
As long as multicollinearity is not perfect, estimation of the regression coefficients is possible but the estimates and their standard errors become very sensitive to even the slightest change in the data.

Detection of Multicollinearity:
There are different methods of detecting multicollinearity such as-

a)

High R 2 but few significant

t ratios

b) High pair wise correlations among regressors c) e) f) Examination of partial correlations Eigen values and condition index Tolerance and variance inflation factor d) Auxiliary regressions

a) High R 2 But Few Significant t Ratios:


This is the classic symptom of multicollinearity. If R 2 is high, say, in excess of 0.8 , the

F test in most

cases will reject the hypothesis that the partial slope coefficients are simultaneously equal to zero, but the individual zero.

t tests will show that none or very few of the partial slope coefficients are statistically different from

b) High Pair-Wise Correlations Among Regressors:


If the pair-wise or zero order correlation coefficient between two regressors is high, say, in excess of 0.8 , then multicollinearity is a serious problem. High zero-order correlations are a sufficient but not a necessary condition for the existence of multicollinearity because it can exist even though the zero-order or simple correlations are comparatively low. That is, suppose we a four variable model:

Yi = 1 + 2 X 2i + 3 X 3i + 4 X 4i + u i
and suppose that X 4i = 2 X 2i + 3 X 3i where 2 and 3 are constants, not both zero. Obviously, X 4 is an exact linear combination of X 2 and
2 X 3 , giving R4 23 = 1 , the coefficient of determination is the regression of X 4 on X 2 and X 3 .

We know that,

2 R4 23 =

2 2 r42 + r43 2r42 r43 r23 2 1 r23

(1)

2 But since R4 23 = 1 because of perfect collinearity, we obtain,

1=
It can be seen that values.

2 2 r42 + r43 2r42 r43 r23 2 1 r23

( 2)

( 2)

is satisfied by r42 = 0.5, r43 = 0.5 and r23 = 0.5 , which are not very high

Therefore, in models involving more than two explanatory variables, the simple or zero-order correlation will not provide an infallible guide to the presence of multicollinearity. Of course, if there are only two explanatory variables, the zero-order correlations will be sufficient.

c) Examination of Partial Correlations:


Because of the problem of using zero-order correlations, one should look at the partial correlation coefficients. Thus, in the regression of
2 Y on X 2 , X 3 and X 4 , we can see that R4 23 is very high but

2 2 2 r 1234 , r 1324 and r 1423 are comparatively low which suggests that the variables X 2 , X 3 and X 4 are

highly intercorrelated and that at least one of these variables is unnecessary.

d) Auxiliary Regressions:
One way of finding out multicollinearity that is which each X i on the remaining

X variables is related to other X variables is to regress

X variables and compute the corresponding R 2 . Each one of these regressions is Y on the X s. Then
2 Rx i x2 xk

called an auxiliary regression, auxiliary to the main regression of

Fi =
n

(1

2 Rx i x 2 xk

( k 2) ( n k + 1)

~ (2k 2 ) ,( n k +1)

where

stands for the sample size, k stands for the number of explanatory variables including the intercept

2 term, and R xi x2 xk is the coefficient of determination in the regression of variable X i on the remaining

variables. If the computed

F exceeds the critical Fi at the chosen level of significance, it is taken to mean that the X s; if it does not exceed the critical Fi , we say that it is not collinear

particular X i is collinear with other with other

X s.

e) Eigen Values and Condition Index:


For given eigen values which can get by using matrix algebra, we can calculate condition number k defined as
k = Maximum eigenvalue Minimum eigenvalue

and the condition index (CI) defined as


k = Maximum eigenvalue = Minimum eigenvalue k

If k is between 100 and 1000 there is moderate to strong multicollinearity and if it exceeds 1000 there is severe multicollinearity. Alternatively, if the CI(=
k ) is between

10 and 30 , there is moderate to strong

multicollinearity and if is exceeds 30 there is severe multicollinearity.

f)

Tolerance and Variance Inflation Factor:


Some authors use the VIF as an indicator of multicollinearity. If the VIF of a variable exceeds 10 which will
2 happen if R j exceeds 0.90 , that variable is said to be highly collinear. One could use TOL j as a measure

of multicollinearity. The closer is TOL j to zero, the greater the degree of collinearity of that variable with the other regressors. On the other hand, the closer the TOL j is to collinear with the other regressors.

1 , the greater the evidence that

is not

Remedial Measures:
There are some rules for removing multicollinearity. a) A priori information b) Combining cross sectional and time series data

c) e) f)

Dropping a variable(s) and specification bias Additional or new data Reducing collinearity in polynomial regressions

d) Transformation of variables

g) Other methods of remedying multicollinearity

a) A Priori Information:
Suppose we consider the model Yi = 1 + 2 X 2i + 3 X 3i + u i where

Y = consumption, X 2 = income, and X 3 = wealth. But suppose a priori we believe that

3 = 0.10 2 . We can then run the following regression:


Yi = 1 + 2 X 2i + 0.10 2 X 3i + u i = 1 + 2 X i + u i
, we can estimate from the postulated relationship where X i = X 2i + 0.10 X 3i . Once we obtain 2 3

between 2 and 3 . We can get priori information from previous empirical work in which the collinearity problem happens to be less serious or from the relevant theory underlying the field of study.

b) Combining Cross-Sectional and Time Series Data:


The combination of cross-sectional and time-series data is known as polling the data. Suppose we want to study the demand for automobiles in the United States and assume we have time series data on the number of cars sold, average price of the car and consumer income. Suppose also that

ln Yi = 1 + 2 ln Pt + 3 ln I t + u t
where

Y = number of cars sold, P = average price, I = income, and t = time. Our objective is to

estimate the price elasticity 2 and income elasticity 3 . In time series data the price and income variables generally tend to be highly collinear. Therefore, we cannot run the preceding regression. A way out of this, if we have cross-sectional data, we can obtain a fairly reliable estimate of the income elasticity 3 because in such data, which are at a point in time, the prices do not very
. Using this estimate, we may write as much. Let the cross-sectionally estimated income elasticity be 3

Yt* = 1 + 2 ln Pt + u t
ln I , that is, Y * represents that value of where Y * = ln Y 3

Y after removing from it the effect of

income. We can now obtain an estimate of the price elasticity 2 from the preceding regression.

c) Dropping a Variable(s) and Specification Bias:


When faced with multicollinearity, one of the simplest things to do is to drop one of the collinear variables. But in dropping a variable form the model we may be committing a specification bias or specification error. Specification bias arises from incorrect specification of the model used in the analysis. So before dropping a variable we should mind that multicollinearity may prevent precise estimation of the parameters of the model, but omitting a variable may seriously mislead us as to the true values of the parameters.

d) Transformation of Variables:
Suppose we have time series data on consumption expenditure, income and wealth. Income and wealth are highly correlated. One way of minimizing this dependence is as follows. If the relation

Yt = 1 + 2 X 2t + 3 X 3t + u t
holds at time

(1)

t , it must also hold at time t 1 because the origin of time is arbitrary anyway.

Therefore, we have

Yt 1 = 1 + 2 X 2,t 1 + 3 X 3,t 1 + u t 1
If we subtract ( 2 ) from (1) , we obtain

( 2)

Yt Yt 1 = 2 ( X 2t X 2,t 1 ) + 3 ( X 3t X 3,t 1 ) + vt
where vt = u t u t 1 . Equation (3) is known as the first difference form.

( 3)

The first difference regression model often reduces the multicollinearity because there is no reason that differences of the variables will also be highly correlated. Another type of transformation may be ration transformation.

e) Additional or New Data:


Since multicollinearity is a sample feature, it is possible that in another sample involving the same variables collinearity may not be so serious as in the first ample. Sometimes simply increasing the size of the sample may remove the collinearity problem.

f) Reducing Collinearity in Polynomial Regressions:


Polynomial regressions reduce multicollinearity among the explanatory variable. If the explanatory variable(s) are expressed in the deviation form, multicollinearity is substantially reduced. But even then the problem may present, in which case one may want to consider techniques such as orthogonal polynomials.

g) Other Methods of Remedying Multicollinearity:


Multivariate statistical techniques such as factor analysis and principal components or techniques such as ridge regression are often employed to solve the problem of multicollinearity.

Anda mungkin juga menyukai