Anda di halaman 1dari 19

Regression: An

Introduction to
Econometrics
Overview
The goal in the econometric work is to help us move from the qualitative analysis in the theoretical work
favored in the textbooks to the quantitative world in which policy makers operate. The focus in this work is
the quantification of relationships. For example, in microeconomics one of the central concepts was
demand. It is one half of the supply-demand model that economists use to explain prices, whether it is the
price of stock, the exchange rate, wages, or the price of bananas. One of the fundamental rules of
economics is the downward sloping demand curve - an increase in price will result in lower demand.

Knowing this, would you be in a position to decide on a pricing strategy for your product? For example,
armed with the knowledge that demand is negatively related to price, do you have enough information to
decide whether a price increase or decrease will raise sales revenue? You may recall from a discussion in
an intro econ course that the answer depends upon the elasticity of demand, a measure of how responsive
demand is to price changes.

But how do we get the elasticity of demand? In your earlier work you were just given the number and
asked how this would influence your choices, while here you will be asked to figure out what the elasticity
figure is. It is here things become interesting, where we must move from the deterministic world of algebra
and calculus to the probabilistic world of statistics. To make this move, a working knowledge of
econometrics, a fancy name for applied statistics, is extremely valuable.

As you will see, this is not a place for the meek at heart. There are a number of valuable techniques you
will be exposed to in econometrics. You will work hard on setting up the 'right experiment' for your study,
collecting the data and specifying the equation. Fortunately, this is only the beginning. There will never be
the magic button that produces 'truth' at the end of some regression, the favorite econometric technique for
estimating relationships. You can also be assured you will not get it quite right the first time. There is,
however, something to be learned from your 'mistakes'. To the trained eye, the summary statistics produced
by any regression package paint a vivid, if somewhat blurred picture, of the problems with the model as
specified. These are problems that must be dealt with because they can produce biases in the results that
reduces the reliability of the regression and increases the chance we will not end up an understanding of the
true relationship. With existing software packages, anyone can produce regression results so one needs to
be aware of the limitations of the analysis when evaluating regression results.

In this overview of econometrics we will begin with a discussion of Specification. What equation will we
estimate? Does demand depend upon price alone, or does income also matter? Is demand linearly or
nonlinearly related to price? These are the types of questions discussed in this section. We will then shift to
Interpretation, a discussion of how to interpret the results of our regression. What if we find out demand is
negatively relate to price? Should we believe the result? And what about the times where demand turns out
to be positively related to price. How could we explain this result and do we actually have proof demand
curves should be positively sloped. This will be followed by a discussion of the assumptions of the
Classical Linear Model, all of the things that must go right if we are to have complete confidence in our
results. And for those instances where we have some reason to believe there is a problem, we have a
discussion of the Limitations of the Classical Linear Model where the potential problems as well as
solutions are discussed.

When you have completed this section, you should be well aware of the fact the estimation of 'economic
relationships' has both an art and a science component. Given the technology available to people today,

1
anyone can run regressions with the use of some magic buttons. Computer programs exist that allow us to
estimate the regressions, perform diagnostics to evaluate the model, and correct any problems encountered.
Do not, however, be misled into thinking your empirical work will be easy. As you will find with your own
work, there is a long road of painful, time-consuming work ahead of anyone who embarks on an empirical
project. Furthermore, there are many places where you can take a wrong turn. This section was designed to
offer you some guidance as you make the journey, to help you know in advance the obstacles you are likely
to encounter and the best way of dealing with them.

There is a second reason for spending the time studying regression analysis and conducting your own
empirical project. The scientific advances are not a guarantee we are more likely to uncover the 'truth' that
we are searching for. The world is in many respects the same as it was when was prompted to write his
wonderful little book entitled, How to Lie With Statistics. In the hands of an unscrupulous researcher, the
modern econometric software increases the chances someone can find the results they want. The
complexities of the statistical analysis simply make it harder to find the biases in the study. Your time spent
here will simply increase the chances of recognizing the biases.

For an on-line overview of regression analysis you might want to check out the DAU and Stockburger
sites.

You should also check out the worksheet Regression, the output from an excel regression. The data on
sheet simple is for years, inflation rate, unemployment rate, and interest rate appear in cells A3 - D50. Once
the data set is complete, you then select Data Analysis in the Tools menu. You will then select Regression,
which will bring up a dialogue box. At this time you highlight the data set for the input box. The Y
variable is the variable you want to explain, in this case and it is the interest rate. The X variable is the
explainer, in this case the inflation rate. We are going to use regression to see the extent to which the
inflation rate explains interest rates. You then specify the top left cell of the space where you want the
output to appear. For an interpretation of the results, you should check out the Interpretation page. In these
results you find the coefficient of inflation to be .68 - every time the inflation rate rises by one percentage
point, interest rates rise by nearly .7 percent. The t-statistic is 7.22, which indicates you should believe in
this relationship, and the R2 tells you the model helps explain about one half the variation in interest rates.

Mechanics
Once you have decided on estimating a relationship using regression analysis, you need to decide upon the
appropriate software package. There are some very useful software packages designed primarily for
regression type analysis you may want to explore if you were doing some high powered regression work or
you were using the software in other courses. Here, however, we will stick with Excel that allows you to
run some simple regression analyses. The first step is creation of the data set, an example of which can be
found on the simple tab on the Regression spreadsheet example. On the simple tab example we will be
looking at a bivariate regression - a regression with only one right-side variable. The estimated equation
will be of the form Y = a + bX + e, where Y is the variable being explained (dependent) and X is the
variable doing the explaining (independent).

To estimate the regression you simply select Data Analysis from the Tool menu and within this select
Regression. You will get a dialogue box into which you need to input the relevant data. In the simple
example we will be trying to identify the impact inflation has on interest rates. Because the causality runs
from inflation to interest rates, the interest rate will be the dependent variable and the inflation rate will be
the independent variable. You will input the dependent variable in the Input Y Range: by highlighting the
interest rate column (C3:C50). You then input the independent variable in the Input X Range: by
highlighting the inflation rate column (B3:B50). Because I did not use the labels you do not check off the
labels box. I then tell it I would like the output to have its top left corner in cell F2. After checking off all
the options you get all of the information on the simple tab.

Below is the data that appears with the regression output. While all of this data gives you important
information about the relationship, at this time your attention should be directed to just a few of the features

2
that are highlighted in red. The first is the adjusted R Square. This tells the reader that of all of the year -
to-year variation in the interest rate, about 52% of it can be explained by movements in the independent
variable (inflation rate). The second things to look for are the coefficients. In this example, the regression
analysis suggests the best equation for these data would be:

Interest rate = 2.44 + .68*Inflation rate

What we are most interested in is the coefficient of the Inflation rate, which in this example is .68. This
means every time the inflation rate rises by one percentage pint (from 4 to 5 percent), then the interest rate
rises by .68 percentage points. The final piece of valuable information is the t-stat, which tells us how
much to "believe" in the coefficients. You notice the t-Stats are associated with each coefficient so you can
actually test the "believability" of all coefficients. Fortunately there is a convenient rule of thumb for the t-
Stats. If the absolute value of the t-stat is greater than 2 then you believe the coefficient is not zero, which
would be the case if there was no relationship, In this example you will see the t-Stats for both the
intercept and the coefficient of the inflation rate are greater than two so you can assume the interest rate is
affected by the inflation rate.

SUMMARY OUTPUT

Regression Statistics
Multiple R 0.72901
R Square 0.531456
Adjusted R
0.52127
Square
Standard
1.990115
Error
Observations 48

ANOVA
Significance
df SS MS F
F
Regression 1 206.6476 206.6476 52.1764 4.22E-09
Residual 46 182.1856 3.960557
Total 47 388.8332

Standard Upper Lower Upper


Coefficients t Stat P-value Lower 95%
Error 95% 95.0% 95.0%
Intercept 2.443228 0.481394 5.075321 6.82E-06 1.474233 3.412222 1.474233 3.412222
X Variable 1 0.680573 0.094219 7.223323 4.22E-09 0.490921 0.870226 0.490921 0.870226

Specification
Before becoming involved with the more sophisticated statistical questions regarding regression analysis, it
is useful to briefly discuss some of the preliminary issues one must deal with in any empirical project. First,
it is important to note the difference between causality and correlation. The statistical analyses that are used
to determine the nature of the causality, never actually allow us to prove the causality. It is impossible to
separate out causality from correlation. All we can reasonably hope to do is find statistical correlations that
do not disprove our hypotheses.

3
Given this limitation, once the decision has been made to undertake an empirical project, the principal
investigator must make a number of important choices. A schematic outline of the process is presented
below. The project starts with the choice of the theoretical relationship which one wants to study, the
hypotheses one wants to test. Before proceeding with the development of the specific model, it is
appropriate to review the scholarly literature. There is little advantage to be gained by rediscovering the
wheel and in reviewing these articles you might find some information that would help in the other stages
in your empirical analysis. A good place to start your search of the literature would be the Journal of
Economic Literature.

After the review of the literature has been completed and the outlines of the model settled on, there is the
need to identify the data necessary to estimate your model. What are you going to use as the dependent
variable? For example, consider the empirical project designed to identify the link between investment
spending and interest rates. There is a need to specify which interest rate it is we are concerned with
explaining. Is it the rate on 3-month government securities or the rate on 30-year bonds we expect to affect
investment decisions?

A decision must also be made concerning the choice of the independent variables. The choice of the
regressors is based on the underlying economic theory. The variables should be selected because there is a
reason to believe they are causally related to Y. If, for example, your goal was to estimate a demand
equation for a certain product, then based on your knowledge of microeconomic theory you would need to
identify at the very least the appropriate data to capture the influence of income, population, and the price
of related goods. In each of these instances, you will be making choices that will significantly affect the
findings of your study. Furthermore, for every variable selected, you should have an a priori expectation for
the estimated parameters. Based on our understanding of economic theory, for example, the coefficient of
price in the demand equation should be negative and the coefficient of income should be positive.

A good example of the importance of the proper specification of the independent variable would be the
treatment of demographic factors in a demand equation. The normal choice would often be the population,
but there may be instances where this is likely to be an inappropriate choice. Consider the demand for
motorcycles. Is it the growth in the population or is it the growth in the population of young people that
matters? To the extent the primary market for motorcycles is younger people, then the use of total
population as an independent variable could cause problems. This would be the case if there was a
divergence between the two growth rates, a phenomenon of the 1970s. Similarly, a model for housing
demand would most certainly include as an independent variable some measure of population. Is it the
number of people or is it the number of separate households that is the primary determinant of demand?
The choice you make will have a significant impact on the results since we find that in the 1980s the
growth rates of the two differed substantially.

The choice of dependent and independent variables involves a number of other crucial decisions. For time-
series analysis, care has to be taken to avoid the mixing of seasonally adjusted and unadjusted data. This is
not a problem when dealing with annual data, but it is a potential problem when dealing with quarterly or
monthly data. It is also often relevant to adjust data for population. For example, in a demand equation for a
specific product, it might be personal income per capita rather than personal income that is the appropriate
independent variable.

One also has choices with regard to the form of the variables. Let us assume we believe the unemployment
rate has an influence on demand. Is it best captured by the level of the unemployment rate, which would be
used as an indicator of ‘ability to pay’, or would it be better measured by the change in the unemployment
rate, which would capture the 'expectations' effect of a change in the direction in the economy? When
estimating a saving equation, should the dependent variable be aggregate savings (S), the average savings
rate (S/Y), or the year-to-year changes in the savings rate (Δ(S/Y)). Most likely, the answer to these
questions will be, at least in part, determined by the empirical work.

One must also be very careful to adjust the data for the influences of inflation. I will always recall my
undergraduate students who reported that the 1970s was a period of high growth because GNP grew more
rapidly during this period than in the 1960s and 1980s. This is certainly not the case. The 1970s figures

4
were primarily a reflection of higher inflation rates and any econometric model should account for these
substantial differences. Returning to the product demand example, the model should certainly be specified
in terms of real, or inflation adjusted income. Similarly, when we examine the relationship between
investment spending and interest rates, it is the real interest rate which we would expect to use as an
independent variable.

There is also the problem of dealing with phenomena that cannot be easily or adequately quantified. In a
model of the inflation-unemployment trade-off, there is reason to believe there was a significant difference
between the 1960s and 1970s. Another situation would be in a model of wage determination where we
were attempting to identify the relationship between average earnings (W) and the number of years of
education (E). In the wage study there would be a need to capture the gender effect because of the sharply
different profiles for males and females. In fact, it is questions such as this that are at the center of many of
the discrimination cases that get to the courtroom. Similarly, in any study of retail toy sales based on
quarterly data, it would be important to take explicit account of the fact sales are typically higher in the
fourth quarter.

Each of these problems can be solved with the use of dummy variables. A dummy variable is a 0-1 variable
that can best be viewed as an on-off switch. The left hand diagram describes the situation where we would
want to add an intercept dummy, a variable that has a value of 0 for each year in the 1960s and a value of 1
in the 1970s. The estimated equation would be:

i = b0 +b1*u +b2*D

The diagram indicates a situation where the coefficient of D would be positive, the intercept is shifted
upwards in the 1970s. The equations for the two time periods would be:

i = b0 +b1*u (1960s)

i = (b0 ++b2) + b1*u (1970s)

A somewhat different situation is depicted in the second diagram. Here it is not the intercept but the slope
seems to vary. For this example consider the situation where the gender variable (G) would have a value of
0 for each observation of a woman's wage, and a value of 1 for each man's wage. We could use this dummy
variable to test the hypotheses that the education-earnings profile for women is flatter than it is for males,
that the extra earnings men receive for an extra year of education are greater than the gains for women. The
equation would be:

Ei = b0 +b1*Educ +b2*D*Educ

In this case, evidence of the steeper slope for the males would be found in a positive coefficient for b2. The
slope of the females curve would be b1 while the slope of the males would be b2. The equations for the two
periods would be:

E = b0 +(b1+b2)*Educ (males)

5
E = b0 +b1*Educ (females)

Finally, in the retail sales equation in which we attempt to identify the link between sales (S) and income
(Y), it would be appropriate to specify three dummy variables. The first dummy variable would have a
value of 1 in the first quarter and 0 other wise, the second would have a value of 1 in the second quarter and
the third dummy variable would have a value of 1 in the third quarter.

S = b0 +b1*D1 + b2*D2 + b3*D3 +b4*Y

In this case, evidence of seasonal patterns in retail sales would be found in the coefficient for the dummy
variables. The equations for the four quarters would be:

S = (b0 +b1) + b4*Y Q1


S = (b0 +b2) + b4*Y Q2
S = (b0 +b3) + b4*Y Q3
S = b0 + b4*Y Q4

If sales were highest in the fourth quarter then the coefficients for all of the dummy variables would be
negative.

Having decided on the appropriate independent variables, the first issue involves the choice of time-series
or cross-section analysis. Returning to the interest rate problem, one possibility would involve a study of
investment spending and interest rates for the year 1991 for a sample of 35 countries. A second approach
could focus on the behavior of these two phenomena in the U.S. for the past 30 years. Each approach has its
strengths and weaknesses and its econometric peculiarities. I suspect, however, the majority of the work
you are likely to do will be time-series analysis. When you work with time-series data you must decide on
both the time period and the frequency of the data (daily, weekly, monthly, quarterly, annually).

We now have the variables and we have the data The final decision to be made is the choice of the
estimation procedure. There are many possibilities open to the researcher interested in quantifying a
specific relationship. At this time I intend only to discuss linear regression, equations that are linear in their
parameters. Furthermore, I do not intend the discussion of regression analysis that follows to be a
replacement for statistics and econometrics texts. The emphasis here will be on a brief overview of the
process one goes through in arriving at a finished product. We will begin at the beginning with the single-
equation, bi-variate linear regression model. The simplest form of the model is:

Yi = B0 + B1Xi+ ei = 1...n
where;
• Yi = ith observation on the dependent variable
• Xi = ith observation on the independent variable
• ei = ith observation on the error term
• B0, B0 = the parameter estimates
• n = number of observations

As is often the case, a picture can save one a good deal of explaining. The data collected on variables Y and
X are presented in a scatter diagram below. Linear regression analysis identifies the equation for the
straight line that best captures the 'flavor' of the scatter. More specifically, the regression procedure
specifies the values of the parameters B0 and B1 so that we have a specific equation, which will allow us to
calculate the 'average' value of Y [AVG(Y)] given the value of X. What remains unexplained by the
equation is captured in the error term. In the diagram below, the actual value of Y for the ath observation is
Ya while the model estimates AVG(Ya) as the value for Y. The difference between these two is the error
term.

6
Bi-variate Regression: The Graphics

We can never expect a perfect fit with our model because there are always going to be some minor
influences on Y omitted in the specification of the model, human behavior will always contain an element
of randomness or unpredictability, the variables may not be measured correctly, and the model may not be
truly linear. We do, however, hope these problems are minor, and when they do surface, we can modify our
analysis in a number of ways to help correct the problems. In any event, as we will see later, the standard
linear regression model is designed to choose the values for the parameters in such a way as to minimize
the errors. For example, in the diagram below it is obvious that the equation Y = B2 +B3X does not
adequately reflect the data and that the error terms would on average be larger. Stated somewhat
differently, the second equation does a much poorer job of representing the data.

Alternative Regression Equations

If this were the end of the story, it would be a short one. The fact is there are few, if any, instances where
the bi-variate model is appropriate because there are few cases where the value of a dependent variable is
influenced by only one independent variable. It is more likely the dependent variable (Y) will be influenced
by a number of independent variables. In this case the linear regression model can be written as:

Yi = B0 + B1X1 i + B2X2 i ...+... BKXK i + e i


i = 1...n
• where;
• Y i = ith observation on the dependent variable
• X ji = ith observation on the jth independent variable
• e i = ith observation on the error term
• B0... BK = the parameter estimates
• K = the number of independent variables
• n = number of observations
It is also true there are many times when the linear model depicted above does not adequately reflect the
data. One possible alternative specification would be the exponential form:

Y = ea1X1b1X2b2e

7
It you believed this was the appropriate model, you would employ a logarithmic transformation, which
makes the equation linear in its parameters:

lnY = a1 + b1lnX1 + b2lnX2 + e

In the case of the exponential model, the sign and size of the estimated coefficients have a significant
impact on the 'picture' of the relationship. The graph below shows the relationship between y and X1 for
different values of the parameter b1. When b>1 we have the familiar parabola and when b1<0 we have the
hyperbola. If a scatter diagram had one of these shapes, it would be appropriate to use the exponential
function.
Exponential Function

An alternative specification would be the semi-log equation. Two possibilities would be the equations:

Y = a1 + b1lnX1 + b2X2 + e

lnY = b0 +b1*X1 +b2X2 +e

The picture of these are presented below. While both of these are legitimate equations, neither is frequently
used. The reason is there is seldom, if ever, a compelling reason to use either of these equations since
simpler specifications with easier interpretations can be used which have similar 'pictures'. It is also a bit of
work to calculate the elasticities which are often the primary concern of the researcher.

The list of alternative functional forms certainly extends beyond the few mentioned here. Some of the more
popular forms are polynomials and equations containing interaction terms and inverses. I would suggest
that your experimentation with these alternative forms be restricted to those times when all else fails. A
more advanced problem involves the specification of a model in which two or more of the variables are
mutually dependent. When analyzing any market for example, it is a safe bet that the quantity demanded
depends upon the market price just as the market price depends on the amount demanded. In this situation it
is important to construct a multi-equation model in which you estimate the parameters of all the equations
simultaneously. More will be said about this problem in the following section.

Interpretation
The task for the researcher at this point, after collection of the data and estimation of the equation, is the
interpretation of the results. A thorough analysis of the results will focus on the extent to which the model
adequately explains the dependent variable and the correspondence between the values of the estimated
parameters and a priori expectations. As for the first of these, the 'goodness of fit', we can return to the
simple bi-variate diagram. There are a number of measures of the 'goodness of fit', but the one which
economists tend to focus on is the sum of squared errors (SST) which is written as:

SST = S(Yi-Y)2

By acknowledging that Yi = Y + e , and that the error term is Yi-Y, the equation can be rewritten as:

8
SST = S(Yi-Y)2 + Sei2

As evident in the diagram below, the total sum of squares can be decomposed into two separate
components, the sum of squares of the difference between the actual value of the dependent variable and its
mean value (Yi-Y) and the sum of squares of the residuals. The first of these terms, referred to as the
regression sum of squares (SSR), represents the amount of the deviations in Y from its mean that is
explained by the model. The unexplained deviations of Y from its mean are captured in the second term,
the error sum of squares (SSE).
Decomposition of Variance

This decomposition of variance provides us with the primary measure of 'goodness of fit', the measure of
how adequately the dependent variable is explained by the model. This measure, known as the coefficient
of determination, is simply defined as the ratio of the explained to the total sum of squares:

R2 = SSR/SST = 1-SSR/SST = 1 - Sei2/S(Yi-Y)2

It should be clear from this formulation that the R2 is bounded by 0 and 1. As the model's explanatory
power increases, the R2 approaches 1. On the other hand, as the scatter of points becomes more random and
the errors increase, the R2 approaches 0. One of the undesirable features of the R2 is the fact it can only
increase as more independent variables are added. The difficulty with the addition of the independent
variables is that it reduces the degrees of freedom (n-K-1), the difference between the number of
independent variables and the number of observations. Any decrease in the degrees of freedom will result
in a loss in the reliability of the model. For this reason, the adjusted R2 has been created. The adjusted R2 is
defined as:
R2 = 1 -((SSE/(n-K-1))/(SST/(n-1))

The coefficient of determination gives us a good 'gut feeling' for the goodness of fit, but it possesses no
statistical properties. It can, however, be slightly modified to allow us a 'statistical' test of the goodness of
fit. As you will find in your statistics book, the decomposition can be reformulated to derive the F-ratio:

F = (SSR/K)/(SSE/(n-K-1))

This is simply the ratio of the explained to the unexplained sum of squares adjusted for the number of
regressors and the number of observations. Unlike the coefficient of determination, the F-ratio has no upper
bound. For high values of F, we can be confident the model does an adequate job of explaining Y. Low
values of F, meanwhile, indicate the model is inadequate as an explanation for Y. The actual division
between the accept and not accept regions can be determined from the F Table.
Having assessed the overall explanatory power of the model, the researcher would turn to the individual
parameter estimates. Are the parameter estimates consistent with their predicted sign based on the
underlying economic theory? Do the parameter estimates suggest that the variable is important? Is there
reason to believe that the dependent variable is statistically related to the independent variable or is it
possible that the two are statistically unrelated?

9
The first two of these questions are generally easily answered. If you have estimated an equation explaining
investment expenditures and the coefficient of the interest rate is positive, the results of your analysis differ
significantly from the established theory. Similarly, if you estimated a consumption equation and found the
coefficient of income was .2, there would be reason to question the analysis. The parameter has the correct
sign, but it is sharply lower than our economic theory would suggest. In both cases, it would be time to look
very closely at the model and begin some diagnostic tests to determine what could be wrong. It would only
be after extensive testing and re-estimation that one would accept these results.

As for the question of statistical significance, the standard regression package will always provide a non
zero estimate of parameters. To help us determine whether there is enough evidence to accept the
hypothesis the parameter is not zero, the t-test was designed. The t- statistic is defined as:

tk = Bk/sk

where Bk is the estimated parameter value for the kth independent variable and sk is the 'standard error' of
this coefficient. As with the F-statistic, a high value for t means that there is reason to accept the model.
More specifically, a value for t above 2 can generally be accepted as evidence of a non zero parameter
value and therefore evidence of a relationship between the kth independent variable and Y.

But can I really believe in my results? Not if you restrict your investigative 'research' to a casual study of
these summary statistics. There are a myriad of possible problems with your work just as there are wide
array of solutions to your problems. To better understand the problems one is likely to encounter when
estimating equations, it is necessary to go back to the beginning briefly. Most of the problems that surface
are the result of the fact that the data has shown that some of the assumptions made during the specification
of the model were inappropriate. It therefore seems appropriate to begin with a quick review of the
assumptions.

Classical Linear Model


To answer the question of believability, one must look at the properties of the estimators. Ordinary least
squares (OLS) is by far the most popular estimation technique. Its popularity stems from the fact it is
simple and it will provide the best linear unbiased estimators under certain assumptions. OLS's limitations,
meanwhile, are that the basic assumptions are quite restrictive and when the assumptions can not be
justified, the estimation technique is not the 'best' estimation technique available. If it turns out that some of
these assumptions do not apply to the problem, then one must consider alternative estimation techniques.

The assumptions that must be valid for OLS to have its desirable estimation properties form the structure of
what is called the Classical Model. The equation which we will be estimating is:

Yi = B0 + B1X1i + B2X2i + BKXKi + ei i = 1...n


The classical assumptions are listed in the table below.

Assumptions of the Classical Model


• 1. The regression model is linear in the coefficients and the error term.
• 2. The error term has a zero mean
• 3. The error terms are not serially correlated
• 4. The error term has a constant variance
• 5. The independent variables are not linearly related
• 6. The independent variables and the error term are unrelated
• 7. The error term is distributed normally

The first of these assumptions is, as we have seen before, much less restrictive than it appears on the
surface. Linear regression refers to 'linear in the parameters' and can be applied to many nonlinear forms
such as the polynomial, log, semi log, and exponential functions. The assumption of a zero mean in the

10
error term is also not very restrictive. In fact, as long as the equation has a constant term, this assumption is
satisfied. The final assumption is primarily of value in hypothesis testing. Referring back to the equation
above, you can envision that one of the goals of the analysis was to determine if the independent variable
X1 had an influence on Y. By assuming that the error term is normally distributed, the researcher can rely
on the t and F-statistics to test these hypotheses. Once again, there is reason to believe that this assumption
is not too restrictive. The basis for this optimism is the Central Limit Theorem that states:

The mean of a number of independent, identically distributed random variables will tend to be normally
distributed, regardless of their distribution, if the number of different random variables is large.

The implication of the assumption of normally distributed error terms is that the estimates of the parameters
are normally distributed because the estimators of the coefficients are simply linear functions of the
normally distributed error term. Stated somewhat differently, if we ran the above regression 100 times we
would generate 100 estimates of the parameter B1. Three possible distributions for the parameter estimates
of B1 are pictured in the graph below. These distributions can be identified by their means and variances. In
distribution #1, the mean of the parameter estimate equals the true parameter value and there is a small
variance in estimates about the mean. Distribution #2 has the same mean, but it has a higher variance, while
Distribution #3 has a large variance and a mean that is not equal to the true parameter value.
In a comparison of the estimation techniques that produced these distributions of parameter estimates,
Distribution #1 would clearly be the preferred choice. This distribution is the most likely going to give us
an estimated parameter with a value close to the 'true' parameter. Distribution #2, meanwhile, will provide
us with an unbiased estimate of the parameter, but is not as efficient as #1 because there is a greater chance
of us being away from the true parameter value. As for Distribution #3, it is less desirable because it
provides biased parameter estimates. Even if we can reduce the variance, the parameter estimate does not
tend to center on the true value of the parameter.

So where does the OLS model fit in? What is the distribution of the estimators from this simple technique?
The answer to these questions is supplied by the Gauss-Markov Theorem, a center-piece of all
econometrics texts. Very simply, the Gauss-Markov model states: If assumptions 1-6 of the Classical
Model are valid, then the OLS estimate of B is BLUE, the minimum variance, linear, unbiased estimator of
B

Given this idealized world, we can now return briefly to the issue of hypothesis testing and the t-statistic
that is at the center of any such testing. For those who are rusty on this issue, you should return to your
statistics book to brush up. We begin with the specification of the hypothesis, or more specifically, with the
specification of the null and alternative hypotheses. The null hypothesis generally expresses the range of
values for the parameters under the assumption the theory was incorrect. For example, in a model of
investment expenditures, a null hypothesis could be that the coefficient of interest rates is zero, evidence
that interest rates do not influence total expenditures. The alternative hypothesis expresses the range of
values likely to occur if the researcher's theory is correct. In this case the alternative hypothesis would be
that the parameter is not equal to zero.

When testing hypotheses, there are two types of errors, referred to as Type I and Type II Errors. Type I
errors are made when we are tempted to reject a hypothesis which is actually true. Type II errors occur
when we do not reject a false hypothesis. A graphical representation of the errors is presented in the
following probability distribution diagrams of the parameter estimates. In the left hand diagram, the true

11
parameter is zero and our null hypothesis is B=0, but we end up with an estimate on the far right tail and we
reject the null hypothesis. In the right hand diagram, the null hypothesis is once again B=0, but here the
true parameter value is 1. If we happen to end up with a parameter estimate near zero, we may reject the
null hypothesis when in fact it was correct.

Having developed the hypotheses, one must establish a decision rule which will specify the acceptable and
unacceptable ranges of some sample statistic. In the case of the regression coefficient, there is a need to
specify the critical value of the parameter estimate which separates the accept from the reject regions. What
choice one makes depends to a large extent on the relative costs of TYPE I and TYPE II errors because
generally any choice which will reduce the probability of one error will increase the probability of the
other. A graphical representation of the specification of the critical value is presented below.

The top graph refers to the situation where we are dealing with a one-tailed test, such as the situation when
we have Ho: B<=0. In this case we have specified a critical value (Bc) so that a value greater than that for
B will lead us to reject the null hypothesis. In the lower graph the two-tailed test is depicted. In this case the
null hypothesis would be, Ho: B=0. The critical value is chosen so that any values of B greater than +Bc or
less than -Bc will result in a rejection of the null hypothesis.

How is it that these critical values are chosen? Most econometricians use the t-test to test hypotheses
concerning individual parameters. Returning to the typical regression equation:

Yi = B0 + B1X1i + B2X2i + BKXKi + ei i = 1...n

the t-value for the ith coefficient can be calculated as:


ti = (Bi - Bhi)/s(Bi)

12
where:
• Bi = estimated regression coefficient
• Bhi = border value implied by null hypothesis
• s(Bi) = estimated standard error of Bi

The value of the t-statistic for the parameter can then be compared with the critical value for the t-statistic
that can be obtained from the tables at the end of nearly any statistics book. The choice of a critical value
depends on the level of significance, or level of confidence that you want in your study and the degrees of
freedom that depends upon the number of observations and the number of estimated parameters. Once the
critical value is chosen, the rule for the hypothesis test is:

Reject Ho: if ti < tc

When reading empirical work in economics you are likely to see the use of 1, 5, and 10 percent levels of
significance. The specification of a 5 percent level of significance means that the probability of observing a
value for the t-statistic that is greater than the critical value is 5 percent if the null hypothesis were correct.
If we turn this around, we could state that we have a 95 percent level of confidence, that the results are
statistically significant at the 95 percent level. The table below provides a guide to the critical values for the
t-statistics when there are 30 degrees of freedom.

Critical Values for t-statistic

Significance Level One-tailed Two-tailed


1 percent 2.457 2.750
5 percent 1.697 2.042
10 percent 1.310 1.697

What about the special case in which the null hypothesis is Ho: B=0? In this situation the border value for
the null hypothesis is 0 and the t-statistic reduces to the ratio of the coefficient to the estimated standard
error of the parameter.

Limitations of Classical Linear Model


So much for the good news. It is these next four assumptions that prove to be the most difficult to accept in
most regression analyses. What are the problems with the parameter estimates caused by dropping these
assumptions and how can the basic OLS estimation technique be altered to provide unbiased, efficient
parameter estimates? As you will see, there are some simple tests that can be invoked to 'test' these
assumptions or some features of the regression output which suggest a problem. Fortunately, once the
problems have been diagnosed, there are some solutions. It is precisely these diagnostics and solutions
which will be the focus in the following section. Please note that this will be a very brief presentation and
that you are encouraged to explore these issues more fully in any econometrics or linear regression text.

Serial Correlation
What is it?
The third assumption of the Classical Model is there are no interdependencies between the errors of the
separate observations. Stated somewhat differently, knowledge of the error for any one observation should
not help us anticipate the error for the next observation. First-order serial correlation is the most common
form of the problem, but certainly not the only form. When it does exist, we can describe the relationship
between the error terms by the equation:

13
et = ret-1 + ut
• where:
• et =the error term
• r = the parameter describing the linkage (-1 < r < +1)
• ut = a classical error term
In the diagrams below, examples of positive and negative serial correlation are presented. In the left hand
side diagram, a positive error for any observation is a good indicator that the next error will be positive
while in the right hand diagram we find that a positive error is a good indicator that the error in the next
observation will be negative.

Why do we Care?

Serial correlation's primary impact on the regression is that it reduces the efficiency of OLS estimators. The
problem is the pattern of serial correlation is likely to be assigned by OLS to the effect of one of the
independent variables. In this sense OLS is more likely to misjudge the correct parameter value, a situation
reflected in a higher variance in the distribution of the parameter estimates.

What do we Do?

The best treatment of the serial correlation problem depends upon the likely source of the problem. One
possibility is an incorrect specification of the functional form. The left hand side diagram below depicts a
situation where the 'true' relationship is nonlinear, but where a linear equation has been estimated. In this
situation the errors for all observations below t1 and above t2 would be positive while they would be
negative for the values of T between these two. In this case the appropriate correction would be to estimate
a different functional form.

Serial correlation can also surface because an important independent variable has been omitted. In the right
hand diagram a variable with a strong cyclical component has been estimated by a linear equation. An
example of where one might encounter this would be a model of retail sales based on quarterly data. In this
situation, fourth quarter sales can be expected to be 'abnormally' high and this information should be built
into the model. One possible correction would be to include a dummy variable to capture the seasonal
effect. Another would be to use seasonally adjusted data.

14
A third possibility is that we have some serial correlation that can not be traced to inappropriate functional
form or missing variables. There are many possible types of serial correlation. The error in the current
period could be related to the error two or four periods ago, or it could be related to some combination of
previous errors. The most likely situation, and the one most closely studied, is first-order serial correlation.
The most widely used test to detect the existence of serial correlation is the Durbin-Watson d Statistic. The
Durbin-Watson Statistic varies from 0 to 4. A value of 0 indicates extreme positive serial correlation (et =
et-1). On the other end of the spectrum would be extreme negative serial correlation (et = -et-1) which would
have a d statistic of 4. If there were no serial correlation, the Durbin-Watson d statistic would approach 2.

The output from a regression package should include the Durbin-Watson Statistic. Armed with the d
statistic and the number of observations and independent variables, one turns to the table containing the
Durbin-Watson Statistic. Once here, you will note a difference from the t and F statistic tables. There are
two values for d, dL and dU. If the regression d value falls below dL, the null hypothesis can be rejected
while a value greater than dU means that the null hypothesis should not be rejected. If, however, the d
statistic falls between these two numbers, the test is inconclusive.

If we get back the results which indicate the existence of serial correlation, the OLS model should be
adjusted. The appropriate adjustment would entail the use of the Generalized Least Squares Model(GLS) as
the estimation technique. The GLS model combines the information contained in the following two
equations:

Yt = B0 + B1X1i + et
et = ret-1 + ut

The first equation is the linear model that contains the serial correlation. The second equation describes the
first order serial correlation. Combining these two equations we get the equation:
Yt = B0 + B1X1t + ret-1 + ut

The problem with this equation is the term ret-1. To eliminate this term we can use the equation for rYt-1:

rYt-1 = rB0 + rB1X1t-1 + ret-1

If we now subtract this from the equation for Yt we get:

Yt -rYt-1 = (1-r)*B0 + B1(X1t -rX1t-1) + ut

OK, so how do we get an estimate of the serial correlation coefficient to use GLS. One possibility is to use
the rough approximation p = 1-d/2. A more sophisticated method would be the Cochrane-Orcutt iterative
technique. The first step requires calculation of error terms from the OLS regression output. These errors
are used to estimate the equation et = ret-1 + ut. With the estimate of the value of r, the GLS model can be
estimated. The residuals from this regression will then be used to re-estimate the equation et = ret-1 + u and
the entire process continues until the values for p change very little in the latest round of the iterative
process. If you do not like this approach, you could try other methods including Hildreth-Lu, Theil-
Nagar,and Durbin, but you are on your own in this venture.

Heteroskedasticity
What is it?

The fourth assumption of the Classical Model is the error terms have a constant variance. As with serial
correlation, there can be many versions of the heteroskedasticity problem and a thorough treatment of all
variations of the problem is well beyond the scope of our present work. Here we will focus our attention on
the form of the problem which you are most likely to encounter. In this version of the problem, which is
demonstrated in the diagram below, the variance in the error term seems to be increasing as the value of X

15
increases. This is likely to be a problem when the data set contains a dependent variable that has a
significant range in its values or there has been a significant change in the data collection procedure.

Why do we Care?

Heteroskedasticity's primary impact on the regression is that it reduces the efficiency of OLS estimators.
The problem is that the pattern of heteroskedasticity is likely to be assigned by OLS to the effect of one of
the independent variables. In this sense OLS is more likely to misjudge the correct parameter value, a
situation reflected in a higher variance in the distribution of the parameter estimates.

What do we Do?

The variety of forms which heteroskedasticity can take translated directly into a wide array of methods for
testing for its existence. Before mentioning the formal approaches to detection and correction, it is worth
noting that a careful selection of the dependent and independent variables can greatly reduce the likelihood
of a problem. If, however, there is a need for a formal test, two widely recognized tests would be the Park
and Goldfeld-Quandt Tests. Both of these methods are designed to test for the existence of a proportionality
factor. For both tests you have to have some notion of what the missing proportionality factor is. In the
Park test you take the residuals from the OLS regression and run a regression in which you estimate the
natural log of the residuals as a function of the natural log of the proportionality factor. [ ln(ei2) = a0 +
a1lnZi +ui]. The coefficient of Z is then tested using the simple t-statistic.

In the Goldfeld-Quandt Test, meanwhile, the sample is ordered by the proportionality factor and the
variance of the residuals is calculated for the lowest and highest third of the sample. The ratio of the
residual sum of squares is formed (RSS3/RSS1) and an F-test is performed to test the hypothesis that this
ratio differs significantly from 1.
Once the problem has been diagnosed, the solution is to use weighted least squares. The entire data set is
divided by the proportionality factor, Z:

Yt/Zt = B0/Zt + B1X1t/Zt + B2X2t/Zt + ut

This is the easy part. From here on you must be quite careful in your interpretation of your results. First,
you can note that there are two distinct possibilities, Z is one of the independent variables or it is not. In the
case that it is not one of the Xs, this weighting will eliminate the constant term. To avoid the problems
associated with this, a constant term could be added to the equation. In the case that Z=X2, there will be a
constant term because X2/Z=1. The equation would be:

Yt/Zt = B0/Zt + B1X1t/Zt + B2 + ut

The only problem here is that the coefficients are a bit more difficult to interpret. The parameter linking the
variables Y and X2 is actually the intercept in this equation and the intercept is the coefficient of Z.

Multicolinearity

16
What is it?
The fifth assumption of the Classical Model is that one independent variable is not a linear combination of
the other independent variables. This, like serial correlation, is a problem most likely to be encountered by
the researcher working with time-series data. While it is true that the extreme case is quite rare, less
extreme cases of multicolinearity often complicate the work of the applied researcher. To better understand
the problem, consider the situation when two independent variables are highly correlated. The regression
package searches for a relationship between the dependent and independent variables. As the dependent
variable changes it looks for similar patterns of changes in the independent variable. For example, if the
value of Y changed 2 every time X changed by 4, the coefficient of X should approach 1/2.
In a situation where we have multicolinearity, however, it becomes difficult to isolate this relationship. For
example, if two independent variables X and Z are highly correlated, they will both tend to change in
similar patterns. The regression package will not be able to disentangle the relationship between Y and X
from the relationship between Y and Z. Is it the change in Z or the change in X that is responsible for the
change in Y? When we have multicolinearity there is no way to know.

Why do we Care?

Multicolinearity's primary impact on the regression is that it increases the variance of the OLS estimators.
By increasing the variance and the standard error, multicolinearity decreases the computed t scores. More
importantly, the difficulty of isolating the unique contribution of each explanatory variable, means that the
estimated parameters are extremely sensitive to the specification of the model. The deletion or addition of
explanatory variables can dramatically alter the parameter estimates and their significance level. A second
indicator of the problem would be a high R2 value with insignificant t-statistics. In this case the regression
model has been able to explain the variation in Y, it simply does not know whether to credit X or Z.

What do we Do?

As indicated above, there are some very good indicators of the existence of multicolinearity. If the t-
statistics and the parameter estimates are quite sensitive the specification of the equation or the equation
has low t scores for its parameter estimates, but a high R2, then the problem exists. The question at that
time is what to do. One possibility that should be seriously considered is to do nothing. A high correlation
between two or more independent variables is not a guarantee of trouble with the regression. Also, the
elimination of a 'theoretically' important variable could result in biased parameter estimates because of
misspecification.
If ignoring the problem does not suit you, then you could always remove the problem variable(s), or, you
could attempt some transformation of the independent variables. Two popular transformations would be to
from a linear combination of the variables or to transform the equation into first differences.

Simultaneous Equations
What is it?

If you really want to do a good job of estimating an empirical relationship, then there is one more potential
problem you must address. Unfortunately, it is no small problem encountered only infrequently, and easily
solved. To understand the problem, consider your favorite theoretical model, supply and demand. In
specifying a demand equation, the price would certainly be one of the right hand side variables. Could you
make the case, however, that the price is unaffected by the level of demand? Except in extreme cases, this
would be an unreasonable assumption and the single equation OLS model, which ignores this possibility, is
an inappropriate estimation technique. Estimation of the equation in which there is no explicit treatment of
the interdependency will violate the sixth assumption of the Classical Model, the assumption that the error
and the right hand side variables are uncorrelated. This results in biased estimates of the parameters.
The nature of the problem can be seen in the left hand scatter diagram below. The researcher has
accumulated the appropriate price and quantity data to estimate a demand curve to determine the price
elasticity of demand. If the model P = b0 + b1*Q is estimated by OLS, the result would be the heavy dashed
line. Based on this one would conclude that demand was inelastic. The problem is that we 'know' that the
true supply and demand curves are given by the solid lines. What we are observing are equilibrium points,

17
points on or near the intersection of shifting supply and demand curves, rather than points along a
stationary demand curve. The secret to solving the simultaneous equations problem is to design a model
which allows one to identify the shifts in the curves.

Simultaneous Equation Bias and


The Identification Problem

In the right hand diagram, the demand curve can be estimated if we know it is the supply curve that has
been shifting in response to some external shocks. The scatter of points is viewed as points on a stable
demand curve and a fluctuating supply curve. Similarly, if it was the demand curve that was shifting, then
the data would allow us to estimate the supply curve. If, however, we assumed both the supply and demand
curves responded to the same external factors, then it would be impossible for us to isolate the two separate
influences and we would not be able to generate estimates of the curves. This is referred to as the
identification problem. The solution to the identification problem is to have at least one independent
variable in each equation that is not in the other equations. The two-equation model below is an example of
a model that is identified.

Qd = b0 + b1*P + b2*Z + e
Qs = a0 + a1*P + a2*W + u

What do we Do?

To better understand the solution to the problem, it is useful to differentiate the structural and reduced
forms of the model. A simple structural model is:

Y1 = b0 + b1*Y2 + b2*Z + e
Y2 = a0 + a1*Y1 + a2*W + u

where the Ys are the dependent variables and Z and W represent a set of predetermined variables. In this
form of the model we find the dependent variables on both sides of the equation. To see the problem,
assume that the error in the Y1 equation is large. If this error is large it will make Y1 large and, because Y1
is an explanatory variable in the Y2 equation, it will alter the size of Y2, a violation of the assumed
independence of the explanatory variables and the errors.

This structural model can be transformed, with the use of some elementary algebra, into its reduced form.
The reduced form of a model is actually the solution of the model.

Y1 = B0 + B1*Z + e
Y2 = A0 + A1*W + u

The equations in the reduced form specify the dependent variables as functions of the predetermined
variables. There are no dependent variables on the right hand side of the equation and therefore OLS can be
used without the problems encountered when estimating the structural form. Unfortunately, it is not often

18
that the original parameters (a's and b's) can be solved from the estimates of the reduced form parameters
(A's and B's).

One, of many possible solutions to the problem would be the use of Two-Stage Least Squares (2SLS). In
the first stage, the reduced form equation for each of the dependent variables that also appears as a right
hand side variable in the structural model. This model produces estimates of these dependent variables as
functions of the predetermined variables. There are no independent variables on the right-side of the
equation and therefore OLS can be used without problems encountered when estimating the structural
form. Unfortunately, it is not often that the original parameters (a's and b's) can be solved from the
estimates of the reduced form parameters (A's and B's).

One of the many possible solutions to the problem would be the use of Two-Stage Least Squares (2SLS). In
the first stage the reduced form equation for each of the dependent variables that also appears as a right-
side variable in the structural model. This model produces estimates of these dependent variables. In stage
two these estimates are substituted for the dependent variables in the structural model and an OLS
regression is estimated.

19

Anda mungkin juga menyukai