Bivariate CLRM
1. (a) The use of vertical rather than horizontal distances relates to the idea
that the explanatory variable, x, is fixed in repeated samples, so what the
model tries to do is to fit the most appropriate value of y using the model for a
given value of x. Taking horizontal distances would have suggested that we
had fixed the value of y and tried to find the appropriate values of x.
(b) When we calculate the deviations of the points, yt, from the fitted
values, y t , some points will lie above the line (yt > y t ) and some will lie below
the line (yt < y t ). When we calculate the residuals ( u t = yt y t ), those
corresponding to points above the line will be positive and those below the line
negative, so adding them would mean that they would largely cancel out. In
fact, we could fit an infinite number of lines with a zero average residual. By
squaring the residuals before summing them, we ensure that they all
contribute to the measure of loss and that they do not cancel. It is then
possible to define unique (ordinary least squares) estimates of the intercept
and slope.
(c) Taking the absolute values of the residuals and minimising their sum
would certainly also get around the problem of positive and negative residuals
cancelling. However, the absolute value function is much harder to work with
than a square. Squared terms are easy to differentiate, so it is simple to find
analytical formulae for the mean and the variance.
2. The population regression function (PRF) is a description of the model that
is thought to be generating the actual data and it represents the true
relationship between the variables. The population regression function is also
known as the data generating process (DGP). The PRF embodies the true
values of and , and for the bivariate model, could be expressed as
y t xt u t
y t x t
Notice that there is no error or residual term in the equation for the SRF: all
this equation states is that given a particular value of x, multiplying it by
1/56
and adding will give the model fitted or expected value for y, denoted y . It
is also possible to write
y t xt u t
This equation splits the observed value of y into two components: the fitted
value from the model, and a residual term. The SRF is used to infer likely
values of the PRF. That is the estimates and are constructed, for the
sample data.
3. An estimator is simply a formula that is used to calculate the estimates, i.e.
the parameters that describe the relationship between two or more
explanatory variables. There are an infinite number of possible estimators;
OLS is one choice that many people would consider a good one. We can say
that the OLS estimator is best i.e. that it has the lowest variance among the
class of linear unbiased estimators. So it is optimal in the sense that no other
linear, unbiased estimator would have a smaller sampling variance. We could
define an estimator with a lower sampling variance than the OLS estimator,
but it would either be nonlinear or biased or both! So there is a tradeoff
between bias and variance in the choice of the estimator.
test stat
* 1.147 1
2.682
0.0548
SE ( )
We want to compare this with a value from the ttable with T2 degrees of
freedom, where T is the sample size, and here T2 =60. We want a value with
5% all in one tail since we are doing a 1sided test. The critical tvalue from the
ttable is 1.671:
3/56
f(x)
5% rejection
region
+1.671
The value of the test statistic is in the rejection region and hence we can reject
the null hypothesis. We have statistically significant evidence that this security
has a beta greater than one, i.e. it is significantly more risky than the market
as a whole.
7. We want to use a twosided test to test the null hypothesis that shares in
Chris Mining are completely unrelated to movements in the market as a
whole. In other words, the value of beta in the regression model would be zero
so that whatever happens to the value of the market proxy, Chris Mining
would be completely unaffected by it.
The null and alternative hypotheses are therefore:
H0 : = 0
H1 : 0
The test statistic has the same format as before, and is given by:
test stat
* 0.214 0
1.150
SE ( )
0.186
We want to find a value from the ttables for a variable with 382=36 degrees
of freedom, and we want to look up the value that puts 2.5% of the distribution
in each tail since we are doing a twosided test and we want to have a 5% size
of test over all:
4/56
2.03
+2.03
( SE ( ) t crit , SE ( ) t crit )
Confidence intervals are almost invariably 2sided, unless we are told
otherwise (which we are not here), so we want to look up the values which put
2.5% in the upper tail and 0.5% in the upper tail for the 95% and 99%
confidence intervals respectively. The 0.5% critical values are given as follows
for a tdistribution with T2=382=36 degrees of freedom:
5/56
2.72
+2.72
6/56
F critical value
4.35
4.08
4.00
3.92
t critical value
2.09
2.02
2.00
1.98
2. (a) H0 : 3 = 2
We could use an F or a t test for this one since it is a single hypothesis
involving only one coefficient. We would probably in practice use a ttest since
it is computationally simpler and we only have to estimate one regression.
There is one restriction.
(b)
H0 : 3 + 4 = 1
Since this involves more than one coefficient, we should use an Ftest. There is
one restriction.
(c)
H0 : 3 + 4 = 1 and 5 = 1
Since we are testing more than one hypothesis simultaneously, we would use
an Ftest. There are 2 restrictions.
(d)
H0 : 2 =0 and 3 = 0 and 4 = 0 and 5 = 0
As for (c), we are testing multiple hypotheses so we cannot use a ttest. We
have 4 restrictions.
(e)
H0 : 2 3 = 1
Although there is only one restriction, it is a multiplicative restriction. We
therefore cannot use a ttest or an Ftest to test it. In fact we cannot test it at
all using the methodology that has been examined in this chapter.
3. THE regression Fstatistic would be given by the test statistic associated
with hypothesis iv) above. We are always interested in testing this hypothesis
since it tests whether all of the coefficients in the regression (except the
constant) are jointly insignificant. If they are then we have a completely
useless regression, where none of the variables that we have said influence y
actually do. So we would need to go back to the drawing board!
7/56
8/56
The tratios are given in the final row above, and are in italics. They are
calculated by dividing the coefficient estimate by its standard error. The
relevant value from the ttables is for a 2sided test with 5% rejection overall.
Tk = 195; tcrit = 1.97. The null hypothesis is rejected at the 5% level if the
absolute value of the test statistic is greater than the critical value. We would
conclude based on this evidence that only firm size and market to book value
have a significant effect on stock returns.
If a stocks beta increases from 1 to 1.2, then we would expect the return on the
stock to FALL by (1.21)*0.084 = 0.0168 = 1.68%
This is not the sign we would have expected on beta, since beta would be
expected to be positively related to return, since investors would require
higher returns as compensation for bearing higher market risk.
7. We would thus consider deleting the price/earnings and beta variables from
the regression since these are not significant in the regression  i.e. they are
not helping much to explain variations in y. We would not delete the constant
term from the regression even though it is insignificant since there are good
statistical reasons for its inclusion.
yt 1 2 x2t 3 x3t 4 yt 1 u t
yt 1 2 x2t 3 x3t 4 yt 1 vt .
9/56
Note that we have not changed anything substantial between these models in
the sense that the second model is just a reparameterisation (rearrangement)
of the first, where we have subtracted yt1 from both sides of the equation.
(a) Remember that the residual sum of squares is the sum of each of the
squared residuals. So lets consider what the residuals will be in each
case. For the first model in the level of y
u t yt y t yt 1 2 x 2t 3 X 3t 4 yt 1
Now for the second model, the dependent variable is now the change in y:
vt yt y t yt 1 2 x 2t 3 x3t 4 yt 1
where y is the fitted value in each case (note that we do not need at this stage
to assume they are the same). Rearranging this second model would give
u t y t y t 1 1 2 x 2t 3 x3t 4 y t 1
y t 1 2 x 2t 3 x3t ( 4 1) y t 1
If we compare this formulation with the one we calculated for the first model,
we can see that the residuals are exactly the same for the two models, with
4 4 1 and i i (i = 1, 2, 3). Hence if the residuals are the same, the
residual sum of squares must also be the same. In fact the two models are
really identical, since one is just a rearrangement of the other.
(b) As for R2, recall how we calculate R2:
R2 1
RSS
for the first model and
( yi y ) 2
R2 1
RSS
in the second case. Therefore since the total sum of
(yi y ) 2
squares (the denominator) has changed, then the value of R2 must have also
changed as a consequence of changing the dependent variable.
(c) By the same logic, since the value of the adjusted R2 is just an algebraic
modification of R2 itself, the value of the adjusted R2 must also change.
8. A researcher estimates the following two econometric models
y t 1 2 x 2t 3 x3t u t
yt 1 2 x2t 3 x3t 4 x4t vt
10/56
(a) The value of R2 will almost always be higher for the second model since it
has another variable added to the regression. The value of R2 would only be
identical for the two models in the very, very unlikely event that the estimated
coefficient on the x4t variable was exactly zero. Otherwise, the R2 must be
higher for the second model than the first.
(b) The value of the adjusted R2 could fall as we add another variable. The
reason for this is that the adjusted version of R2 has a correction for the loss of
degrees of freedom associated with adding another regressor into a regression.
This implies a penalty term, so that the value of the adjusted R2 will only rise if
the increase in this penalty is more than outweighed by the rise in the value of
R2.
11. R2 may be defined in various ways, but the most common is
ESS
R2
TSS
Since both ESS and TSS will have units of the square of the dependent
variable, the units will cancel out and hence R2 will be unitfree!
11/56
y t =
The problem appears to be that the regression parameters are all individually
insignificant (i.e. not significantly different from zero), although the value of
R2 and its adjusted version are both very high, so that the regression taken as a
whole seems to indicate a good fit. This looks like a classic example of what we
term near multicollinearity. This is where the individual regressors are very
closely related, so that it becomes difficult to disentangle the effect of each
individual variable upon the dependent variable.
The solution to near multicollinearity that is usually suggested is that since the
problem is really one of insufficient information in the sample to determine
12/56
each of the coefficients, then one should go out and get more data. In other
words, we should switch to a higher frequency of data for analysis (e.g. weekly
instead of monthly, monthly instead of quarterly etc.). An alternative is also to
get more data by using a longer sample period (i.e. one going further back in
time), or to combine the two independent variables in a ratio (e.g. x2t / x3t ).
Other, more ad hoc methods for dealing with the possible existence of near
multicollinearity were discussed in Chapter 4:

Ignore it: if the model is otherwise adequate, i.e. statistically and in terms
of each coefficient being of a plausible magnitude and having an
appropriate sign. Sometimes, the existence of multicollinearity does not
reduce the tratios on variables that would have been significant without
the multicollinearity sufficiently to make them insignificant. It is worth
stating that the presence of near multicollinearity does not affect the BLUE
properties of the OLS estimator i.e. it will still be consistent, unbiased
and efficient since the presence of near multicollinearity does not violate
any of the CLRM assumptions 14. However, in the presence of near
multicollinearity, it will be hard to obtain small standard errors. This will
not matter if the aim of the modelbuilding exercise is to produce forecasts
from the estimated model, since the forecasts will be unaffected by the
presence of near multicollinearity so long as this relationship between the
explanatory variables continues to hold over the forecasted sample.
Transform the highly correlated variables into a ratio and include only the
ratio and not the individual variables in the regression. Again, this may be
unacceptable if financial theory suggests that changes in the dependent
variable should occur following changes in the individual explanatory
variables, and not a ratio of them.
13/56
DW t 2
ut 1
u
t 2
2
t
You would then need to look up the two critical values from the Durbin
Watson tables, and these would depend on how many variables and how many
observations and how many regressors (excluding the constant this time) you
had in the model.
The rejection / nonrejection rule would be given by selecting the appropriate
region from the following diagram:
14/56
The major steps involved in calculating the long run solution are to
 set the disturbance term equal to its expected value of zero
 drop the time subscripts
 remove all difference terms altogether since these will all be zero by the
definition of the long run in this context.
Following these steps, we obtain
0 1 4 y 5 x 2 6 x3 7 x3
We now want to rearrange this to have all the terms in x2 together and so that
y is the subject of the formula:
4 y 1 5 x 2 6 x3 7 x3
4 y 1 5 x 2 ( 6 7 ) x3
( 4 )
y 1 5 x2 6
x3
4 4
4
The last equation above is the long run solution.
15/56
16/56
1981M11995M12
rt = 0.0215 + 1.491 rmt
1981M11987M10
rt = 0.0163 + 1.308 rmt
1987M111995M12
rt = 0.0360 + 1.613 rmt
RSS=0.189 T=180
RSS=0.079 T=82
RSS=0.082 T=98
(c) If we define the coefficient estimates for the first and second halves of the
sample as 1 and 1, and 2 and 2 respectively, then the null and alternative
hypotheses are
H0 : 1 = 2 and 1 = 2
and
H1 : 1 2 or 1 2
*
15.304
RSS 1 RSS 2
k
0.079 0.082
2
This follows an F distribution with (k,T2k) degrees of freedom. F(2,176) =
3.05 at the 5% level. Clearly we reject the null hypothesis that the coefficients
are equal in the two subperiods.
10. The data we have are
1981M11995M12
rt = 0.0215 + 1.491 Rmt
1981M11994M12
rt = 0.0212 + 1.478 Rmt
1982M11995M12
17/56
RSS=0.189 T=180
RSS=0.148 T=168
RSS=0.182 T=168
First, the forward predictive failure test  i.e. we are trying to see if the model
for 1981M11994M12 can predict 1995M11995M12.
The test statistic is given by
RSS RSS1 T1 k 0.189 0.148 168 2
*
*
3.832
RSS1
T2
0.148
12
Where T1 is the number of observations in the first period (i.e. the period that
we actually estimate the model over), and T2 is the number of observations we
are trying to predict. The test statistic follows an Fdistribution with (T2, T1k) degrees of freedom. F(12, 166) = 1.81 at the 5% level. So we reject the null
hypothesis that the model can predict the observations for 1995. We would
conclude that our model is no use for predicting this period, and from a
practical point of view, we would have to consider whether this failure is a
result of atypical behaviour of the series outofsample (i.e. during 1995), or
whether it results from a genuine deficiency in the model.
The backward predictive failure test is a little more difficult to understand,
although no more difficult to implement. The test statistic is given by
*
0.532
RSS 1
T2
0.182
12
Now we need to be a little careful in our interpretation of what exactly are the
first and second sample periods. It would be possible to define T1 as always
being the first sample period. But I think it easier to say that T1 is always the
sample over which we estimate the model (even though it now comes after the
holdoutsample). Thus T2 is still the sample that we are trying to predict, even
though it comes first. You can use either notation, but you need to be clear and
consistent. If you wanted to choose the other way to the one I suggest, then
you would need to change the subscript 1 everywhere in the formula above so
that it was 2, and change every 2 so that it was a 1.
Either way, we conclude that there is little evidence against the null
hypothesis. Thus our model is able to adequately backcast the first 12
observations of the sample.
11. By definition, variables having associated parameters that are not
significantly different from zero are not, from a statistical perspective, helping
to explain variations in the dependent variable about its mean value. One
could therefore argue that empirically, they serve no purpose in the fitted
regression model. But leaving such variables in the model will use up valuable
degrees of freedom, implying that the standard errors on all of the other
parameters in the regression model, will be unnecessarily higher as a result. If
the number of degrees of freedom is relatively small, then saving a couple by
deleting two variables with insignificant parameters could be useful. On the
other hand, if the number of degrees of freedom is already very large, the
18/56
19/56
yt = yt1 + ut
yt = 0.5 yt1 + ut
yt = 0.8 ut1 + ut
(1)
(2)
(3)
(a) The first two models are roughly speaking AR(1) models, while the last is
an MA(1). Strictly, since the first model is a random walk, it should be called
an ARIMA(0,1,0) model, but it could still be viewed as a special case of an
autoregressive model.
(b) We know that the theoretical acf of an MA(q) process will be zero after q
lags, so the acf of the MA(1) will be zero at all lags after one. For an
autoregressive process, the acf dies away gradually. It will die away fairly
quickly for case (2), with each successive autocorrelation coefficient taking on
a value equal to half that of the previous lag. For the first case, however, the
acf will never die away, and in theory will always take on a value of one,
whatever the lag.
Turning now to the pacf, the pacf for the first two models would have a large
positive spike at lag 1, and no statistically significant pacfs at other lags.
Again, the unit root process of (1) would have a pacf the same as that of a
stationary AR process. The pacf for (3), the MA(1), will decline geometrically.
(c) Clearly the first equation (the random walk) is more likely to represent
stock prices in practice. The discounted dividend model of share prices states
that the current value of a share will be simply the discounted sum of all
expected future dividends. If we assume that investors form their expectations
about dividend payments rationally, then the current share price should
embody all information that is known about the future of dividend payments,
20/56
and hence todays price should only differ from yesterdays by the amount of
unexpected news which influences dividend payments.
Thus stock prices should follow a random walk. Note that we could apply a
similar rational expectations and random walk model to many other kinds of
financial series.
If the stock market really followed the process described by equations (2) or
(3), then we could potentially make useful forecasts of the series using our
model. In the latter case of the MA(1), we could only make onestep ahead
forecasts since the memory of the model is only that length. In the case of
equation (2), we could potentially make a lot of money by forming multiple
step ahead forecasts and trading on the basis of these.
Hence after a period, it is likely that other investors would spot this potential
opportunity and hence the model would no longer be a useful description of
the data.
(d) See the book for the algebra. This part of the question is really an extension
of the others. Analysing the simplest case first, the MA(1), the memory of
the process will only be one period, and therefore a given shock or
innovation, ut, will only persist in the series (i.e. be reflected in yt) for one
period. After that, the effect of a given shock would have completely worked
through.
For the case of the AR(1) given in equation (2), a given shock, ut, will persist
indefinitely and will therefore influence the properties of yt for ever, but its
effect upon yt will diminish exponentially as time goes on.
In the first case, the series yt could be written as an infinite sum of past
shocks, and therefore the effect of a given shock will persist indefinitely, and
its effect will not diminish over time.
4. (a) Box and Jenkins were the first to consider ARMA modelling in this
logical and coherent fashion. Their methodology consists of 3 steps:
Identification  determining the appropriate order of the model using
graphical procedures (e.g. plots of autocorrelation functions).
Estimation  of the parameters of the model of size given in the first stage.
This can be done using least squares or maximum likelihood, depending on
the model.
Diagnostic checking  this step is to ensure that the model actually estimated is
adequate. B & J suggest two methods for achieving this:
 Overfitting, which involves deliberately fitting a model larger than
that suggested in step 1 and testing the hypothesis that all the additional
coefficients can jointly be set to zero.
21/56
yt 0803
. yt 1 0.682 yt 2 ut
Rewrite this as
yt (1 0.803L 0.682 L2 ) ut
We want to find the roots of the lag polynomial (1 0.803L 0.682 L2 ) 0 and
determine whether they are greater than one in absolute value. It is easier (in
my opinion) to rewrite this formula (by multiplying through by 1/0.682,
using z for the characteristic equation and rearranging) as
z2 + 1.177 z  1.466 = 0
22/56
Using the standard formula for obtaining the roots of a quadratic equation,
z
1177
.
1177
. 2 4 * 1 * 1466
.
= 0.758 or 1.934
2
Since ALL the roots must be greater than one for the model to be stationary,
we conclude that the estimated model is not stationary in this case.
6. Using the formulae above, we end up with the following values for each
criterion and for each model order (with an asterisk denoting the smallest
value of the information criterion in each case).
ARMA (p,q) model order
(0,0)
(1,0)
(0,1)
(1,1)
(2,1)
(1,2)
(2,2)
(3,2)
0.842*
(2,3)
(3,3)
log ( 2 )
0.932
0.864
0.902
0.836
0.801
0.821
0.789
0.773
AIC
0.942
0.884
0.922
0.866
0.841
0.861
0.839
0.833*
SBIC
0.944
0.887
0.925
0.870
0.847
0.867
0.846
0.782
0.764
0.842
0.834
0.851
0.844
The result is pretty clear: both SBIC and AIC say that the appropriate model is
an ARMA(3,2).
7. We could still perform the LjungBox test on the residuals of the estimated
models to see if there was any linear dependence left unaccounted for by our
postulated models.
Another test of the models adequacy that we could use is to leave out some of
the observations at the identification and estimation stage, and attempt to
construct out of sample forecasts for these. For example, if we have 2000
observations, we may use only 1800 of them to identify and estimate the
models, and leave the remaining 200 for construction of forecasts. We would
then prefer the model that gave the most accurate forecasts.
8. This is not true in general. Yes, we do want to form a model which fits the
data as well as possible. But in most financial series, there is a substantial
amount of noise. This can be interpreted as a number of random events that
are unlikely to be repeated in any forecastable way. We want to fit a model to
the data which will be able to generalise. In other words, we want a model
which fits to features of the data which will be replicated in future; we do not
want to fit to samplespecific noise.
23/56
This is why we need the concept of parsimony  fitting the smallest possible
model to the data. Otherwise we may get a great fit to the data in sample, but
any use of the model for forecasts could yield terrible results.
Another important point is that the larger the number of estimated
parameters (i.e. the more variables we have), then the smaller will be the
number of degrees of freedom, and this will imply that coefficient standard
errors will be larger than they would otherwise have been. This could lead to a
loss of power in hypothesis tests, and variables that would otherwise have
been significant are now insignificant.
9. (a) We class an autocorrelation coefficient or partial autocorrelation
1
coefficient as significant if it exceeds 1.96
= 0.196. Under this rule, the
T
sample autocorrelation functions (sacfs) at lag 1 and 4 are significant, and the
spacfs at lag 1, 2, 3, 4 and 5 are all significant.
This clearly looks like the data are consistent with a first order moving average
process since all but the first acfs are not significant (the significant lag 4 acf is
a typical wrinkle that one might expect with real data and should probably be
ignored), and the pacf has a slowly declining structure.
(b) The formula for the LjungBox Q* test is given by
m
Q* T (T 2)
k 1
k2
T k
m2
19.41.
100 1 100 2 100 3
The 5% and 1% critical values for a 2 distribution with 3 degrees of freedom
are 7.81 and 11.3 respectively. Clearly, then, we would reject the null
hypothesis that the first three autocorrelation coefficients are jointly not
significantly different from zero.
10. (a) To solve this, we need the concept of a conditional expectation,
i.e. Et 1 ( y t y t 2 , y t 3 ,...)
For example, in the context of an AR(1) model such as , yt a0 a1 yt 1 ut
24/56
If we are now at time t1, and dropping the t1 subscript on the expectations
operator
E ( yt ) a0 a1 yt 1
E ( yt 1 ) a0 a1 E ( yt )
= a0 a1 yt 1 (a0 a1 yt 1 )
a0 a0a1 a12 yt 1
E ( yt 2 ) a0 a1 E ( yt 1 )
= a0 a1 (a0 a1 E ( yt ))
2
= a0 a0a1 a1 E ( yt )
2
= a0 a0a1 a1 E ( yt )
2
= a0 a0a1 a1 (a0 a1 yt 1 )
2
3
= a0 a0a1 a1 a0 a1 yt 1
etc.
f t 1,1 a0 a1 yt 1
f t 1,2 a0 a1 f t 1,1
f t 1,3 a0 a1 f t 1,2
=
E ( yt yt 1 , yt 2 ,...)
So
ft1,1
E (ut b1ut 1 )
b1u t 1
=
=
E (ut 1 b1ut )
0
b1u t 1
But
E ( yt 1 yt 1 , yt 2 ,...)
25/56
ft1,3
etc.
(b) Given the forecasts and the actual value, it is very easy to calculate the
MSE by plugging the numbers in to the relevant formula, which in this case is
MSE
1
N
n 1
( xt 1 n f t 1, n ) 2
MSE
Notice also that 84% of the total MSE is coming from the error in the first
forecast. Thus error measures can be driven by one or two times when the
model fits very badly. For example, if the forecast period includes a stock
market crash, this can lead the mean squared error to be 100 times bigger
than it would have been if the crash observations were not included. This point
needs to be considered whenever forecasting models are evaluated. An idea of
whether this is a problem in a given situation can be gained by plotting the
forecast errors over time.
(c) This question is much simpler to answer than it looks! In fact, the inclusion
of the smoothing coefficient is a red herring  i.e. a piece of misleading and
useless information. The correct approach is to say that if we believe that the
exponential smoothing model is appropriate, then all useful information will
have already been used in the calculation of the current smoothed value
(which will of course have used the smoothing coefficient in its calculation).
Thus the three forecasts are all 0.0305.
(d) The solution is to work out the mean squared error for the exponential
smoothing model. The calculation is
26/56
1
(0.0305 0.032) 2 (0.0305 0.961) 2 (0.0305 0.203) 2
3
MSE
1
0.0039 0.8658 0.0298 0.2998
3
Therefore, we conclude that since the mean squared error is smaller for the
exponential smoothing model than the Box Jenkins model, the former
produces the more accurate forecasts. We should, however, bear in mind that
the question of accuracy was determined using only 3 forecasts, which would
be insufficient in a real application.
11. (a) The shapes of the acf and pacf are perhaps best summarised in a table:
Process
White
noise
AR(2)
MA(1)
ARMA(2
,1)
acf
No significant coefficients
pacf
No significant coefficients
Geometrically declining or
damped sinusoid acf
A couple of further points are worth noting. First, it is not possible to tell what
the signs of the coefficients for the acf or pacf would be for the last three
processes, since that would depend on the signs of the coefficients of the
processes. Second, for mixed processes, the AR part dominates from the point
of view of acf calculation, while the MA part dominates for pacf calculation.
(b) The important point here is to focus on the MA part of the model and to
ignore the AR dynamics. The characteristic equation would be
(1+0.42z) = 0
The root of this equation is 1/0.42 = 2.38, which lies outside the unit circle,
and therefore the MA part of the model is invertible.
(c) Since no values for the series y or the lagged residuals are given, the
answers should be stated in terms of y and of u. Assuming that information is
available up to and including time t, the 1step ahead forecast would be for
time t+1, the 2step ahead for time t+2 and so on. A useful first step would be
to write the model out for y at times t+1, t+2, t+3, t+4:
yt 1 0.036 0.69 yt 0.42u t u t 1
yt 2 0.036 0.69 yt 1 0.42u t 1 u t 2
27/56
The 1step ahead forecast would simply be the conditional expectation of y for
time t+1 made at time t. Denoting the 1step ahead forecast made at time t as
ft,1, the 2step ahead forecast made at time t as ft,2 and so on:
since Et[ut+1]=0 and Et[ut+2]=0. Thus, beyond 1step ahead, the MA(1) part of
the model disappears from the forecast and only the autoregressive part
remains. Although we do not know yt+1, its expected value is the 1step ahead
forecast that was made at the first stage, ft,1.
The 3step ahead forecast would be given by
E ( yt 3 yt , yt 1,...) ft ,3 Et [ yt 3 ] Et [0.036 0.69 yt 2 0.42ut 2 ut 3 ] 0.036 0.69 f t , 2
28/56
Q T
k 1
2
k
and
k2
Q* T (T 2)
k 1
T k
and
71.39
500 2
500 3 500 4
500 5
500 1
29/56
The test statistics will both follow a 2 distribution with 5 degrees of freedom
(the number of autocorrelation coefficients being used in the test). The critical
values are 11.07 and 15.09 at 5% and 1% respectively. Clearly, the null
hypothesis that the first 5 autocorrelation coefficients are jointly zero is
resoundingly rejected.
(c) Setting aside the lag 5 autocorrelation coefficient, the pattern in the table is
for the autocorrelation coefficient to only be significant at lag 1 and then to fall
rapidly to values close to zero, while the partial autocorrelation coefficients
appear to fall much more slowly as the lag length increases. These
characteristics would lead us to think that an appropriate model for this series
is an MA(1). Of course, the autocorrelation coefficient at lag 5 is an anomaly
that does not fit in with the pattern of the rest of the coefficients. But such a
result would be typical of a real data series (as opposed to a simulated data
series that would have a much cleaner structure). This serves to illustrate that
when econometrics is used for the analysis of real data, the data generating
process was almost certainly not any of the models in the ARMA family. So all
we are trying to do is to find a model that best describes the features of the
data to hand. As one econometrician put it, all models are wrong, but some are
useful!
(d) Forecasts from this ARMA model would be produced in the usual way.
Using the same notation as above, and letting fz,1 denote the forecast for time
z+1 made for x at time z, etc:
Model A: MA(1)
f z ,1 0.38 0.10u t 1
f z , 2 0.38 0.10 0.02 0.378
f z , 2 f z ,3 0.38
Note that the MA(1) model only has a memory of one period, so all forecasts
further than one step ahead will be equal to the intercept.
Model B: AR(2)
xt 0.63 0.17 xt 1 0.09 xt 2
30/56
one extra MA term and one extra AR term. Thus it would be sensible to try an
ARMA(1,2) in the context of Model A, and an ARMA(3,1) in the context of
Model B. Residual diagnostics would involve examining the acf and pacf of the
residuals from the estimated model. If the residuals showed any action, that
is, if any of the acf or pacf coefficients showed statistical significance, this
would suggest that the original model was inadequate. Residual diagnostics
in the BoxJenkins sense of the term involved only examining the acf and pacf,
rather than the array of diagnostics considered in Chapter 4.
It is worth noting that these two model evaluation procedures would only
indicate a model that was too small. If the model were too large, i.e. it had
superfluous terms, these procedures would deem the model adequate.
(f) There are obviously several forecast accuracy measures that could be
employed, including MSE, MAE, and the percentage of correct sign
predictions. Assuming that MSE is used, the MSE for each model is
MSE ( Model A)
MSE ( Model B)
1
(0.378 0.62) 2 (0.38 0.19) 2 (0.38 0.32) 2 (0.38 0.72) 2 0.175
4
1
(0.681 0.62) 2 (0.718 0.19) 2 (0.690 0.32) 2 (0.683 0.72) 2 0.326
4
Therefore, since the mean squared error for Model A is smaller, it would be
concluded that the moving average model is the more accurate of the two in
this
case.
31/56
y1t 0 1 y 2 t 2 y 3t 3 X 1t 4 X 2 t u1t
y 2 t 0 1 y 3t 2 X 1t 3 X 3t u2 t
y 3t 0 1 y1t 2 X 2 t 3 X 3t u3t
(1)
(2)
( 3)
The easiest place to start (I think) is to take equation (1), and substitute in for
y3t, to get
(5)
3 X 3t u2 t ) 2 0 2 2 X 2 t 2 3 X 3t 2 u3t 3 X 1t 4 X 2 t u1t
Taking the y1t terms to the LHS:
y1t (1 2 1 1 1 1 ) 0 1 0 1 1 0 1 2 X 2 t 1 1 3 X 3t 1 1u3t 1 2 X 1t
y1t (1 2 1 1 1 1 ) 0 1 0 1 1 0 2 0 X 1t (1 2 3 ) X 2 t (1 1 2 2 2 4 )
X 3t (1 1 3 1 3 2 3 ) u3t (1 1 2 ) 1 u2 t u1t
(6)
Multiplying all through equation (3) by (1 2 1 1 1 1 ) :
y3t (1 2 1 11 1 ) 0 (1 2 1 11 1 ) 1 y1t (1 2 1 11 1 )
2 X 2 t (1 2 1 11 1 ) 3 X 3t (1 2 1 11 1 ) u3t (1 2 1 11 1 )
Replacing y1t (1 2 1 11 1 )
(7)
in (7) with the RHS of (6),
0 1 0 1 1 0 2 0 X 1t (1 2 3 )
y 3t (1 2 1 11 1 ) 0 (1 2 1 11 1 ) 1 X 2 t (1 1 2 2 2 4 ) X 3t (1 1 3 1 3
2 3 ) u3t (1 1 2 ) 1u2 t u1t
2 X 2 t (1 2 1 11 1 ) 3 X 3t (1 2 1 11 1 ) u3t (1 2 1 11 1 )
(8)
Expanding the brackets in equation (8) and cancelling the relevant terms
y3t (1 2 1 11 1 ) 0 10 11 0 X 1t (1 2 1 1 3 ) X 2 t ( 2 14 )
X 3t ( 11 3 3 ) u3t 11u2 t 1u1t
(9)
Multiplying all through equation (2) by (1 2 1 1 1 1 ) :
y2 t (1 1 1 1 12 ) 0 (1 1 1 1 12 ) 1 y3t (1 1 1 1 12 )
2 X 1t (1 1 1 1 12 ) 3 X 3t (1 1 11 12 ) u2 t (1 1 1 1 12 )
(10)
Replacing y3t (1 2 1 11 1 )
0 1 0 11 0 X 1t (1 2 1 1 3 )
y 2 t (1 1 1 1 1 2 ) 0 (1 1 1 1 12 ) 1 X 2 t ( 2 1 4 ) X 3t ( 3 11 3 ) u3t
11u2 t 1u1t
2 X 1t (1 1 1 1 12 ) 3 X 3t (1 1 1 1 1 2 ) u2 t (1 1 1 1 1 2 )
(11)
Expanding the brackets in (11) and cancelling the relevant terms
y2t (1 1 1
(
1 12 ) 0 02 1
1 0
1 10 X 1t
1 1 3 2 22 1 ) X 2 t (
1 2
1 14 )
X 3t (
1 3 3 32 1 ) 1u3t u2 t (1 2 1 )
1 1u1t
(12)
33/56
Although it might not look like it (!), equations (6), (12), and (9) respectively
will give the reduced form equations corresponding to (1), (2), and (3), by
doing the necessary division to make y1t, y2t, or y3t the subject of the formula.
From (6),
0 1 0 1 1 0 2 0
(1 2 3 )
( 2 2 4 )
X 1t 1 1 2
X 2t
(1 2 1 1 1 1 )
(1 2 1 1 1 1 )
(1 2 1 1 1 1 )
(1 1 3 1 3 2 3 )
u ( 2 ) 1u2 t u1t
X 3t 3t 1 1
(1 2 1 1 1 1 )
(1 2 1 1 1 1 )
(13)
From (12),
y1t
y2 t
0 02 1 1 01 10 ( 1 1 3 2 22 1 )
( 1 2 1 14 )
X1t
X
(1 1 11 12 )
(1 1 11 12 )
(1 1 11 12 ) 2 t
( 1 3 3 32 1 )
u u (1 2 1 ) 1 1u1t
X 3t 1 3t 2 t
(1 1 11 12 )
(1 1 11 12 )
(14)
From (9),
y 3t
0 10 11 0
(1 2 1 1 3 )
( 2 1 4 )
X 1t
X
(1 2 1 11 1 ) (1 2 1 11 1 )
(1 2 1 11 1 ) 2 t
( 11 3 3 )
u 11u2 t 1u1t
X 3t 3t
(1 2 1 11 1 )
(1 2 1 11 1 )
(15)
Notice that all of the reduced form equations (13)(15) in this case depend on
all of the exogenous variables, which is not always the case, and that the
equations contain only exogenous variables on the RHS, which must be the
case for these to be reduced forms.
(b) The term identification refers to whether or not it is in fact possible to
obtain the structural form coefficients (the , , and s in equations (1)(3))
from the reduced form coefficients (the s) by substitution. An equation can
be overidentified, justidentified, or underidentified, and the equations in a
system can have differing orders of identification. If an equation is underidentified (or not identified), then we cannot obtain the structural form
coefficients from the reduced forms using any technique. If it is just identified,
we can obtain unique structural form estimates by backsubstitution, while if
it is overidentified, we cannot obtain unique structural form estimates by
substituting from the reduced forms.
There are two rules for determining the degree of identification of an
equation: the rank condition, and the order condition. The rank condition is a
necessary and sufficient condition for identification, so if the rule is satisfied,
it guarantees that the equation is indeed identified. The rule centres around a
restriction on the rank of a submatrix containing the reduced form
34/56
coefficients, and is rather complex and not particularly illuminating, and was
therefore not covered in this course.
The order condition, can be expressed in a number of ways, one of which is the
following. Let G denote the number of structural equations (equal to the
number of endogenous variables). An equation is just identified if G1
variables are absent. If more than G1 are absent, then the equation is overidentified, while if fewer are absent, then it is not identified.
Applying this rule to equations (1)(3), G=3, so for an equation to be
identified, we require 2 to be absent. The variables in the system are y1, y2, y3,
X1, X2, X3. Is this the case?
Equation (1): X3t only is missing, so the equation is not identified.
Equation (2): y1t and X2t are missing, so the equation is just identified.
Equation (3): y2t and X1t are missing, so the equation is just identified.
However, the order condition is only a necessary (and not a sufficient)
condition for identification, so there will exist cases where a given equation
satisfies the order condition, but we still cannot obtain the structural form
coefficients. Fortunately, for small systems this is rarely the case. Also, in
practice, most systems are designed to contain equations that are overidentified.
(c). It was stated in Chapter 4 that omitting a relevant variable from a
regression equation would lead to an omitted variable bias (in fact an
inconsistency as well), while including an irrelevant variable would lead to
unbiased but inefficient coefficient estimates. There is a direct analogy with
the simultaneous variable case. Treating a variable as exogenous when it really
should be endogenous because there is some feedback, will result in biased
and inconsistent parameter estimates. On the other hand, treating a variable
as endogenous when it really should be exogenous (that is, having an equation
for the variable and then substituting the fitted value from the reduced form if
2SLS is used, rather than just using the actual value of the variable) would
result in unbiased but inefficient coefficient estimates.
If we take the view that consistency and unbiasedness are more important that
efficiency (which is the view that I think most econometricians would take),
this implies that treating an endogenous variable as exogenous represents the
more severe misspecification. So if in doubt, include an equation for it!
(Although, of course, we can test for exogeneity using a Hausmantype test).
(d). A tempting response to the question might be to describe indirect least
squares (ILS), that is estimating the reduced form equations by OLS and then
substituting back to get the structural forms; however, this response would be
WRONG, since the question tells us that the system is overidentified.
A correct answer would be to describe either two stage least squares (2SLS) or
instrumental variables (IV). Either would be acceptable, although IV requires
the user to determine an appropriate set of instruments and hence 2SLS is
simpler in practice. 2SLS involves estimating the reduced form equations, and
35/56
obtaining the fitted values in the first stage. In the second stage, the structural
form equations are estimated, but replacing the endogenous variables on the
RHS with their stage one fitted values. Application of this technique will yield
unique and unbiased structural form coefficients.
2. (a) A glance at equations (6.97) and (6.98) reveals that the dependent
variable in (6.97) appears as an explanatory variable in (6.98) and that the
dependent variable in (6.98) appears as an explanatory variable in (6.97). The
result is that it would be possible to show that the explanatory variable y2t in
(6.97) will be correlated with the error term in that equation, u1t, and that the
explanatory variable y1t in (6.98) will be correlated with the error term in that
equation, u2t. Thus, there is causality from y1t to y2t and from y2t to y1t, so that
this is a simultaneous equations system. If OLS were applied separately to
each of equations (6.97) and (6.98), the result would be biased and
inconsistent parameter estimates. That is, even with an infinitely large
number of observations, OLS could not be relied upon to deliver the
appropriate parameter estimates.
(b) If the variable y1t had not appeared on the RHS of equation (6.98), this
would no longer be a simultaneous system, but would instead be an example
of a triangular system (see question 3). Thus it would be valid to apply OLS
separately to each of the equations (6.97) and (6.98).
(c) The order condition for determining whether an equation from a
simultaneous system is identified was described in question 1, part (b). There
are 2 equations in the system of (6.97) and (6.98), so that only 1 variable
would have to be missing from an equation to make it just identified. If no
variables are absent, the equation would not be identified, while if more than
one were missing, the equation would be overidentified. Considering
equation (6.97), no variables are missing so that this equation is not
identified, while equation (6.98) excludes only variable X2t, so that it is just
identified.
(d) Since equation (6.97) is not identified, no method could be used to obtain
estimates of the parameters of this equation, while either ILS or 2SLS could be
used to obtain estimates of the parameters of (6.98), since it is just identified.
ILS operates by obtaining and estimating the reduced form equations and
then obtaining the structural parameters of (6.98) by algebraic backsubstitution. 2SLS involves again obtaining and estimating the reduced form
equations, and then estimating the structural equations but replacing the
endogenous variables on the RHS of (6.97) and (6.98) with their reduced form
fitted values.
Comparing between ILS and 2SLS, the former method only requires one set of
estimations rather than two, but this is about its only advantage, and
conducting a second stage OLS estimation is usually a computationally trivial
exercise. The primary disadvantage of ILS is that it is only applicable to just
identified equations, whereas many sets of equations that we may wish to
estimate are overidentified. Second, obtaining the structural form coefficients
36/56
via algebraic substitution can be a very tedious exercise in the context of large
systems (as the solution to question 1, part (a) shows!).
(e) The Hausman procedure works by first obtaining and estimating the
reduced form equations, and then estimating the structural form equations
separately using OLS, but also adding the fitted values from the reduced form
estimations as additional explanatory variables in the equations where those
variables appear as endogenous RHS variables. Thus, if the reduced form
fitted values corresponding to equations (6.97) and (6.98) are given by y1t and
y2t respectively, the Hausmann test equations would be
y1t 0 1 y 2t 2 X 1t 3 X 2t 4 y 2t 'u1t
y 2t 0 1 y1t 2 X 1t 3 y1t ' u1t
Separate tests of the significance of the y1t and y2t terms would then be
performed. If it were concluded that they were both significant, this would
imply that additional explanatory power can be obtained by treating the
variables as endogenous.
3. An example of a triangular system was given in Section 6.7. Consider a
scenario where there are only two endogenous variables. The key distinction
between this and a fully simultaneous system is that in the case of a triangular
system, causality runs only in one direction, whereas for a simultaneous
equation, it would run in both directions. Thus, to give an example, for the
system to be triangular, y1 could appear in the equation for y2 and not vice
versa. For the simultaneous system, y1 would appear in the equation for y2,
and y2 would appear in the equation for y1.
4. (a) p=2 and k=3 implies that there are two variables in the system, and that
both equations have three lags of the two variables. The VAR can be written in
longhand form as:
y1t 10 111 y1t 1 211 y 2t 1 112 y1t 2 212 y 2t 2 113 y1t 3 213 y 2t 3 u1t
y 2t 20 121 y1t 1 221 y 2t 1 122 y1t 2 222 y 2t 2 123 y1t 3 223 y 2t 3 u 2t
10
y1t
u1t
where 0 , yt , ut , and the coefficients on the lags of yt
20
y2 t
u2 t
are defined as follows: ijk refers to the kth lag of the ith variable in the jth
equation. This seems like a natural notation to use, although of course any
sensible alternative would also be correct.
(b) This is basically a what are the advantages of VARs compared with
structural models? type question, to which a simple and effective response
would be to list and explain the points made in the book.
37/56
The most important point is that structural models require the researcher to
specify some variables as being exogenous (if all variables were endogenous,
then none of the equations would be identified, and therefore estimation of the
structural equations would be impossible). This can be viewed as a restriction
(a restriction that the exogenous variables do not have any simultaneous
equations feedback), often called an identifying restriction. Determining
what are the identifying restrictions is supposed to be based on economic or
financial theory, but Sims, who first proposed the VAR methodology, argued
that such restrictions were incredible. He thought that they were too loosely
based on theory, and were often specified by researchers on the basis of giving
the restrictions that the models required to make the equations identified.
Under a VAR, all the variables have equations, and so in a sense, every variable
is endogenous, which takes the ability to cheat (either deliberately or
inadvertently) or to misspecify the model in this way, out of the hands of the
researcher.
Another possible reason why VARs are popular in the academic literature is
that standard form VARs can be estimated using OLS since all of the lags on
the RHS are counted as predetermined variables.
Further, a glance at the academic literature which has sought to compare the
forecasting accuracies of structural models with VARs, reveals that VARs seem
to be rather better at forecasting (perhaps because the identifying restrictions
are not valid). Thus, from a purely pragmatic point of view, researchers may
prefer VARs if the purpose of the modelling exercise is to produce precise
point forecasts.
(c) VARs have, of course, also been subject to criticisms. The most important
of these criticisms is that VARs are atheoretical. In other words, they use very
little information form economic or financial theory to guide the model
specification process. The result is that the models often have little or no
theoretical interpretation, so that they are of limited use for testing and
evaluating theories.
Second, VARs can often contain a lot of parameters. The resulting loss in
degrees of freedom if the VAR is unrestricted and contains a lot of lags, could
lead to a loss of efficiency and the inclusion of lots of irrelevant or marginally
relevant terms. Third, it is not clear how the VAR lag lengths should be
chosen. Different methods are available (see part (d) of this question), but
they could lead to widely differing answers.
Finally, the very tools that have been proposed to help to obtain useful
information from VARs, i.e. impulse responses and variance decompositions,
are themselves difficult to interpret! See Runkle (1987).
(d) The two methods that we have examined are model restrictions and
information criteria. Details on how these work are contained in Sections
6.12.4 and 6.12.5. But briefly, the model restrictions approach involves
starting with the larger of the two models and testing whether it can be
restricted down to the smaller one using the likelihood ratio test based on the
38/56
39/56
40/56
yt = + t + u t
where t = 1, 2,, is the trend and ut is a zero mean white noise disturbance
term. This is called deterministic nonstationarity because the source of the
nonstationarity is a deterministic straight line process.
A variable containing a stochastic trend will also not cross its mean value
frequently and will wander a long way from its mean value. A stochastically
nonstationary process could be a unit root or explosive autoregressive process
such as
yt = yt1 + ut
where 1.
2. (a)The null hypothesis is of a unit root against a one sided stationary
alternative, i.e. we have
H0 : yt I(1)
H1 : yt I(0)
which is also equivalent to
H0 : = 0
H1 : < 0
(b) The test statistic is given by / SE ( ) which equals 0.02 / 0.31 = 0.06
Since this is not more negative than the appropriate critical value, we do not
reject the null hypothesis.
(c) We therefore conclude that there is at least one unit root in the series
(there could be 1, 2, 3 or more). What we would do now is to regress 2yt on
yt1 and test if there is a further unit root. The null and alternative hypotheses
would now be
H0 : yt I(1) i.e. yt I(2)
H1 : yt I(0) i.e. yt I(1)
If we rejected the null hypothesis, we would therefore conclude that the first
differences are stationary, and hence the original series was I(1). If we did not
reject at this stage, we would conclude that yt must be at least I(2), and we
would have to test again until we rejected.
(d) We cannot compare the test statistic with that from a tdistribution since
we have nonstationarity under the null hypothesis and hence the test statistic
will no longer follow a tdistribution.
3. Using the same regression as above, but on a different set of data, the
researcher now obtains the estimate =0.52 with standard error = 0.16.
41/56
(a) The test statistic is calculated as above. The value of the test statistic = 0.52 /0.16 = 3.25. We therefore reject the null hypothesis since the test
statistic is smaller (more negative) than the critical value.
(b) We conclude that the series is stationary since we reject the unit root null
hypothesis. We need do no further tests since we have already rejected.
(c) The researcher is correct. One possible source of nonwhiteness is when
the errors are autocorrelated. This will occur if there is autocorrelation in the
original dependent variable in the regression (yt). In practice, we can easily
get around this by augmenting the test with lags of the dependent variable to
soak up the autocorrelation. The appropriate number of lags can be
determined using the information criteria.
4. (a) If two or more series are cointegrated, in intuitive terms this implies that
they have a long run equilibrium relationship that they may deviate from in
the short run, but which will always be returned to in the long run. In the
context of spot and futures prices, the fact that these are essentially prices of
the same asset but with different delivery and payment dates, means that
financial theory would suggest that they should be cointegrated. If they were
not cointegrated, this would imply that the series did not contain a common
stochastic trend and that they could therefore wander apart without bound
even in the long run. If the spot and futures prices for a given asset did
separate from one another, market forces would work to bring them back to
follow their long run relationship given by the cost of carry formula.
The EngleGranger approach to cointegration involves first ensuring that the
variables are individually unit root processes (note that the test is often
conducted on the logs of the spot and of the futures prices rather than on the
price series themselves). Then a regression would be conducted of one of the
series on the other (i.e. regressing spot on futures prices or futures on spot
prices) would be conducted and the residuals from that regression collected.
These residuals would then be subjected to a DickeyFuller or augmented
DickeyFuller test. If the null hypothesis of a unit root in the DF test
regression residuals is not rejected, it would be concluded that a stationary
combination of the nonstationary variables has not been found and thus that
there is no cointegration. On the other hand, if the null is rejected, it would be
concluded that a stationary combination of the nonstationary variables has
been found and thus that the variables are cointegrated.
Forming an error correction model (ECM) following the EngleGranger
approach is a 2stage process. The first stage is (assuming that the original
series are nonstationary) to determine whether the variables are
cointegrated. If they are not, obviously there would be no sense in forming an
ECM, and the appropriate response would be to form a model in first
differences only. If the variables are cointegrated, the second stage of the
process involves forming the error correction model which, in the context of
spot and futures prices, could be of the form given in equation (7.57) on page
345.
42/56
(b) There are many other examples that one could draw from financial or
economic theory of situations where cointegration would be expected to be
present and where its absence could imply a permanent disequilibrium. It is
usually the presence of market forces and investors continually looking for
arbitrage opportunities that would lead us to expect cointegration to exist.
Good illustrations include equity prices and dividends, or price levels in a set
of countries and the exchange rates between them. The latter is embodied in
the purchasing power parity (PPP) theory, which suggests that a
representative basket of goods and services should, when converted into a
common currency, cost the same wherever in the world it is purchased. In the
context of PPP, one may expect cointegration since again, its absence would
imply that relative prices and the exchange rate could wander apart without
bound in the long run. This would imply that the general price of goods and
services in one country could get permanently out of line with those, when
converted into a common currency, of other countries. This would not be
expected to happen since people would spot a profitable opportunity to buy
the goods in one country where they were cheaper and to sell them in the
country where they were more expensive until the prices were forced back into
line. There is some evidence against PPP, however, and one explanation is that
transactions costs including transportation costs, currency conversion costs,
differential tax rates and restrictions on imports, stop full adjustment from
taking place. Services are also much less portable than goods and everybody
knows that everything costs twice as much in the UK as anywhere else in the
world.
5. (a) The Johansen test is computed in the following way. Suppose we have p
variables that we think might be cointegrated. First, ensure that all the
variables are of the same order of nonstationary, and in fact are I(1), since it is
very unlikely that variables will be of a higher order of integration. Stack the
variables that are to be tested for cointegration into a pdimensional vector,
called, say, yt. Then construct a p1 vector of first differences, yt, and form
and estimate the following VAR
yt = ytk + 1 yt1 + 2 yt2 + ... + k1 yt(k1) + ut
Then test the rank of the matrix . If is of zero rank (i.e. all the eigenvalues
are not significantly different from zero), there is no cointegration, otherwise,
the rank will give the number of cointegrating vectors. (You could also go into
a bit more detail on how the eigenvalues are used to obtain the rank.)
(b) Repeating the table given in the question, but adding the null and
alternative hypotheses in each case, and letting r denote the number of
cointegrating vectors:
43/56
Null
Hypothesis
r=0
r=1
r=2
r=3
r=4
Alternative
Hypothesis
r=1
r=2
r=3
r=4
r=5
max
38.962
29.148
16.304
8.861
1.994
95%
value
33.178
27.169
20.278
14.036
3.962
Critical
Considering each row in the table in turn, and looking at the first one first, the
test statistic is greater than the critical value, so we reject the null hypothesis
that there are no cointegrating vectors. The same is true of the second row
(that is, we reject the null hypothesis of one cointegrating vector in favour of
the alternative that there are two). Looking now at the third row, we cannot
reject (at the 5% level) the null hypothesis that there are two cointegrating
vectors, and this is our conclusion. There are two independent linear
combinations of the variables that will be stationary.
(c) Johansens method allows the testing of hypotheses by considering them
effectively as restrictions on the cointegrating vector. The first thing to note is
that all linear combinations of the cointegrating vectors are also cointegrating
vectors. Therefore, if there are many cointegrating vectors in the unrestricted
case and if the restrictions are relatively simple, it may be possible to satisfy
the restrictions without causing the eigenvalues of the estimated coefficient
matrix to change at all. However, as the restrictions become more complex,
renormalisation will no longer be sufficient to satisfy them, so that imposing
them will cause the eigenvalues of the restricted coefficient matrix to be
different to those of the unrestricted coefficient matrix. If the restriction(s)
implied by the hypothesis is (are) nearly already present in the data, then the
eigenvectors will not change significantly when the restriction is imposed. If,
on the other hand, the restriction on the data is severe, then the eigenvalues
will change significantly compared with the case when no restrictions were
imposed.
The test statistic for testing the validity of these restrictions is given by
p
i r 1
*
i
where
If the restrictions are supported by the data, the eigenvalues will not change
much when the restrictions are imposed and so the test statistic will be small.
(d) There are many applications that could be considered, and tests for PPP,
for cointegration between international bond markets, and tests of the
expectations hypothesis were presented in Sections 7.9, 7.10, and 7.11
respectively. These are not repeated here.
(e) Both Johansen statistics can be thought of as being based on an
examination of the eigenvalues of the long run coefficient or matrix. In both
cases, the g eigenvalues (for a system containing g variables) are placed
ascending order: 1 2 ... g. The maximal eigenvalue (i.e. the max)
statistic is based on an examination of each eigenvalue separately, while the
trace statistic is based on a joint examination of the gr largest eigenvalues. If
the test statistic is greater than the critical value from Johansens tables, reject
the null hypothesis that there are r cointegrating vectors in favour of the
alternative that there are r+1 (for max) or more than r (for trace). The testing
is conducted in a sequence and under the null, r = 0, 1, ..., g1 so that the
hypotheses for trace and max are as follows
r=0
r=1
r=2
...
r = p1
Trace alternative
Max alternative
H1: 0 < r g
H1: 1 < r g
H1: 2 < r g
...
H1: r = g
H1: r = 1
H1: r = 2
H1: r = 3
...
H1: r = g
Thus the trace test starts by examining all eigenvalues together to test H 0: r =
0, and if this is not rejected, this is the end and the conclusion would be that
there is no cointegration. If this hypothesis is not rejected, the largest
eigenvalue would be dropped and a joint test conducted using all of the
eigenvalues except the largest to test H0: r = 1. If this hypothesis is not
rejected, the conclusion would be that there is one cointegrating vector, while
if this is rejected, the second largest eigenvalue would be dropped and the test
statistic recomputed using the remaining g2 eigenvalues and so on. The
testing sequence would stop when the null hypothesis is not rejected.
The maximal eigenvalue test follows exactly the same testing sequence with
the same null hypothesis as for the trace test, but the max test only considers
one eigenvalue at a time. The null hypothesis that r = 0 is tested using the
largest eigenvalue. If this null is rejected, the null that r = 1 is examined using
the second largest eigenvalue and so on.
45/56
6. (a) The operation of the Johansen test has been described in the book, and
also in question 5, part (a) above. If the rank of the matrix is zero, this
implies that there is no cointegration or no common stochastic trends between
the series. A finding that the rank of is one or two would imply that there
were one or two linearly independent cointegrating vectors or combinations of
the series that would be stationary respectively. A finding that the rank of is
3 would imply that the matrix is of full rank. Since the maximum number of
cointegrating vectors is g1, where g is the number of variables in the system,
this does not imply that there 3 cointegrating vectors. In fact, the implication
of a rank of 3 would be that the original series were stationary, and provided
that unit root tests had been conducted on each series, this would have
effectively been ruled out.
(b) The first test of H0: r = 0 is conducted using the first row of the table.
Clearly, the test statistic is greater than the critical value so the null hypothesis
is rejected. Considering the second row, the same is true, so that the null of r =
1 is also rejected. Considering now H0: r = 2, the test statistic is smaller than
the critical value so that the null is not rejected. So we conclude that there are
2 cointegrating vectors, or in other words 2 linearly independent combinations
of the nonstationary variables that are stationary.
7. The fundamental difference between the EngleGranger and the Johansen
approaches is that the former is a singleequation methodology whereas
Johansen is a systems technique involving the estimation of more than one
equation. The two approaches have been described in detail in Chapter 7 and
in the answers to the questions above, and will therefore not be covered again.
The main (arguably only) advantage of the EngleGranger approach is its
simplicity and its intuitive interpretability. However, it has a number of
disadvantages that have been described in detail in Chapter 7, including its
inability to detect more than one cointegrating relationship and the
impossibility of validly testing hypotheses about the cointegrating vector.
46/56
12It1
= 1 if ut= 0 otherwise
The EGARCH model also has the added benefit that the model is expressed in
terms of the log of ht, so that even if the parameters are negative, the
conditional variance will always be positive. We do not therefore have to
artificially impose nonnegativity constraints.
One form of the GARCHM model can be written
12
so that the model allows the lagged value of the conditional variance to affect
the return. In other words, our best current estimate of the total risk of the
(e). Since yt are returns, we would expect their mean value (which will be
but suppose that we had a year of daily r
average daily percentage return over the year, which might be, say 0.05
something of that order. The unconditional variance of the disturbances would

48/56
respectively. The important thing is that all three alphas must be positive, and
(f) Since the model was estimated using maximum likelihood, it does not seem
natural to test this restriction using the Ftest via comparisons of residual
sums of squares (and a ttest cannot be used since it is a test involving more
than one coefficient). Thus we should use one of the approaches to hypothesis
testing based on the principles of maximum likelihood (Wald, Lagrange
Multiplier, Likelihood Ratio). The easiest one to use would be the likelihood
ratio test, which would be computed as follows:
1.
Estimate the unrestricted model and obtain the maximised value of the
loglikelihood function.
2.
Impose the restriction by rearranging the model, and estimate the
restricted model, again obtaining the value of the likelihood at the new
optimum. Note that this value of the LLF will be likely to be lower than the
unconstrained maximum.
3.
where Lr and Lu are the values of the LLF for the restricted and unrestricted
models respectively, and m denotes the number of restrictions, which in this
case is one.
4.
If the value of the test statistic is greater than the critical value, reject
the null hypothesis that the restrictions are valid.
(g) In fact, it is possible to produce volatility (conditional variance) forecasts in
exactly the same way as forecasts are generated from an ARMA model by
iterating through the equations with the conditional expectations operator.
We know all information including that available up to time T. The answer to
this question will use the convention from the GARCH modelling literature to
denote the conditional variance by ht rather than
denotes all information available up to and including observation T. Adding 1
then 2 then 3 to each of the time subscripts, we have the conditional variance
equations for times T+1, T+2, and T+3:
(1)
hT+1
(2)
(3)
49/56
Let be the one step ahead forecast for h made at time T. This is easy to
calculate since, at time T, we know the values of all the terms on the RHS.
Given , how do we calculate , that is the 2step ahead forecast for h made at
time T?
From (2), we can write
(4)
where ET( ) is the expectation, made at time T, of , which is the squared
disturban
can now write
Var(ut) = E[(ut E(ut))2]= E[(ut)2].
The conditional variance of ut is ht, so
Turning this argument around, and applying it to the problem that we have,
ET[(uT+1)2] = hT+1
but we do not know hT+1 , so we replace it with , so that (4) becomes
And so on. This is the method we could use to forecast the conditional variance
of yt. If yt were, say, daily returns on the FTSE, we could use these volatility
forecasts as an input in the Black Scholes equation to help determine the
appropriate price of FTSE index options.
(h) An sstep ahead forecast for the conditional variance could be written
(x)
hich is bigger than 1. It is obvious
forecasts to explode. The forecasts will keep on increasing and will tend to
infinity as the forecast horizon increases (i.e. as s increases). This is obviously
an undesirable property of a forecasting model! This is called nonstationarity in variance.
50/56
GARCH or IGARCH, there is a unit root in the conditional variance, and the
forecasts will stay constant as the forecast horizon increases.
2. (a) Maximum likelihood works by finding the most likely values of the
parameters given the actual data. More specifically, a loglikelihood function is
formed, usually based upon a normality assumption for the disturbance terms,
and the values of the parameters that maximise it are sought. Maximum
likelihood estimation can be employed to find parameter values for both linear
and nonlinear models.
(b) The three hypothesis testing procedures available within the maximum
likelihood approach are lagrange multiplier (LM), likelihood ratio (LR) and
Wald tests. The differences between them are described in Figure 8.4, and are
not defined again here. The Lagrange multiplier test involves estimation only
under the null hypothesis, the likelihood ratio test involves estimation under
both the null and the alternative hypothesis, while the Wald test involves
estimation only under the alternative. Given this, it should be evident that the
LM test will in many cases be the simplest to compute since the restrictions
implied by the null hypothesis will usually lead to some terms cancelling out to
give a simplified model relative to the unrestricted model.
(c) OLS will give identical parameter estimates for all of the intercept and
slope parameters, but will give a slightly different parameter estimate for the
variance of the disturbances. These are shown in the Appendix to Chapter 8.
The difference in the OLS and maximum likelihood estimators for the variance
of the disturbances can be seen by comparing the divisors of equations
(8A.25) and (8A.26).
3. (a) The unconditional variance of a random variable could be thought of,
abusing the terminology somewhat, as the variance without reference to a
time index, or rather the variance of the data taken as a whole, without
conditioning on a particular information set. The conditional variance, on the
other hand, is the variance of a random variable at a particular point in time,
conditional upon a particular information set. The variance of ut, , conditional
1, ut2,...) = E[(ut1, ut2,...], while the unconditional variance would simply be
Var(u
Forecasts from models such as GARCH would be conditional forecasts,
produced for a particular point in time, while historical volatility is an
unconditional measure that would generate unconditional forecasts. For
producing 1step ahead forecasts, it is likely that a conditional model making
use of recent relevant information will provide more accurate forecasts
(although whether it would in any particular application is an empirical
question). As the forecast horizon increases, however, a GARCH model that is
stationary in variance will yield forecasts that converge upon the longterm
average (historical) volatility. By the time we reach 20steps ahead, the
51/56
52/56
unweighted averages described above. However, GARCH models are far more
difficult to estimate than the other two models, and sometimes, when
estimation goes wrong, the resulting parameter estimates can be nonsensical,
leading to nonsensical forecasts as well. Thus it is important to apply a reality
check to estimated GARCH models to ensure that the coefficient estimates
are intuitively plausible. Finally, implied volatility estimates are those derived
from the prices of traded options. The marketimplied volatility forecasts are
obtained by backing out the volatility from the price of an option using an
option pricing formula together with an iterative search procedure. Financial
market practitioners would probably argue that implied forecasts of the future
volatility of the underlying asset are likely to be more accurate than those
estimated from statistical models because the people who work in financial
markets know more about what is likely to happen to those instruments in the
future than econometricians do. Also, an inaccurate volatility forecast
implied from an option price may imply an inaccurate option price and
therefore the possibility of arbitrage opportunities. However, the empirical
evidence on the accuracy of implied versus statistical forecasting models is
mixed, and some research suggests that implied volatility systematically overestimates the true volatility of the underlying asset returns. This may arise
from the use of an incorrect option pricing formula to obtain the implied
volatility for example, the BlackScholes model assumes that the volatility of
the underlying asset is fixed (nonstochastic), and also that the returns to the
underlying asset are normally distributed. Both of these assumptions are at
best tenuous. A further reason for the apparent failure of the implied model
may be a manifestation of the peso problem. This occurs when market
practitioners include in the information set that they use to price options the
possibility of a very extreme return that has a low probability of occurrence,
but has important ramifications for the price of the option due to its sheer
size. If this event does not occur in the sample period over which the implied
and actual volatilities are compared, the implied model will appear inaccurate.
Yet this does not mean that the practitioners forecasts were wrong, but rather
simply that the lowprobability, highimpact event did not happen during that
sample period. It is also worth stating that only one implied volatility can be
calculated from each option price for the average volatility of the underlying
asset over the remaining lifetime of the option.
The coefficients expected would be very small for the conditional mean
be positive or negative, although a positive average return is probably more
likely. Similarly, the intercept terms in the conditional variance equations
would also be expected to be small since and positive this is daily data. The
coefficients on the lagged squared error and lagged conditional variance in the
53/56
conditional variance equations must lie between zero and one, and more
values for the conditional covariance equation are more difficult to predict:
covariances. The parameters in this equation could be negative, although
given that the returns for two stock markets are likely to be positively
correlated, the parameters would probably be positive, although the model
would still be a valid one if they were not.
(b) One of two procedures could be used. Either the daily returns data would
be transformed into weekly returns data by adding up the returns over all of
the trading days in each week, or the model would be estimated using the daily
data. Daily forecasts would then be produced up to 10 days (2 trading weeks)
ahead.
In both cases, the models would be estimated, and forecasts made of the
conditional variance and conditional covariance. If daily data were used to
estimate the model, the forecasts for the conditional covariance forecasts for
the 5 trading days in a week would be added together to form a covariance
forecast for that week, and similarly for the variance. If the returns had been
aggregated to the weekly frequency, the forecasts used would simply be 1step
ahead.
Finally, the conditional covariance forecast for the week would be divided by
the product of the square roots of the conditional variance forecasts to obtain a
correlation forecast.
(c) There are various approaches available, including computing simple
historical correlations, exponentially weighted measures, and implied
correlations derived from the prices of traded options.
(d) The simple historical approach is obviously the simplest to calculate, but
has two main drawbacks. First, it does not weight information: so any
observations within the sample will be given equal weight, while those outside
the sample will automatically be given a weight of zero. Second, any extreme
observations in the sample will have an equal effect until they abruptly drop
out of the measurement period. For example, suppose that one year of daily
data is used to estimate volatility. If the sample is rolled through one day at a
time, an observation corresponding to a market crash will appear in the next
250 samples, with equal effect, but with then disappear altogether.
Exponentially weighted moving average models of covariance and variance
(which can be used to construct correlation measures) more plausibly give
additional weight to more recent observations, with the weight given to each
observation declining exponentially as they go further back into the past.
These models have the undesirable property that the forecasts for different
numbers of steps ahead will be the same. Hence the forecasts will not tend to
the unconditional mean as those from a suitable GARCH model would.
54/56
Finally, implied correlations may at first blush appear to be the best method
for calculating correlation forecasts accurately, for they rely on information
obtained from the market itself. After all, who should know better about future
correlations in the markets than the people who work in those markets?
However, marketbased measures of volatility and correlation are sometimes
surprisingly inaccurate, and are also sometimes difficult to obtain. Most
fundamentally, correlation forecasts will only be available where there is an
option traded whose payoffs depend on the prices of two underlying assets.
For all other situations, a marketbased correlation forecast will simply not be
available.
Finally, multivariate GARCH models will give more weight to recent
observations in computing the forecasts, but maybe difficult and compute
timeintensive to estimate.
5. A news impact curve shows the effect of shocks of different magnitudes on
the next periods volatility. These curves can be used to examine visually
whether there are any asymmetry effects in volatility for a particular set of
data. For the data given in this question, the way I would approach it is to put
values of the lagged error into column A ranging from 1 to +1 in increments
of 0.01. Then simply enter the formulae for the GARCH and EGARCH models
into columns 2 and 3 that refer to those values of the lagged error put in
column A. The graph obtained would be
This graph is a bit of an odd one, in the sense that the conditional variance is
always lower for the EGARCH model. This may suggest estimation error in
one of the models. There is some evidence for asymmetries in the case of the
EGARCH model since the value of the conditional variance is 0.1 for a shock of
1 and 0.12 for a shock of 1.
(b) This is a tricky one. The leverage effect is used to rationalise a finding of
asymmetries in equity returns, but such an argument cannot be applied to
foreign exchange returns, since the concept of a Debt/Equity ratio has no
meaning in that context.
On the other hand, there is equally no reason to suppose that there are no
asymmetries in the case of fx data. The data used here were daily USD_GBP
returns for 19741994. It might be the case that, for example, that news
relating to one country has a differential impact to equally good and bad news
relating to another. To offer one illustration, it might be the case that the bad
news for the currently weak euro has a bigger impact on volatility than news
about the currently strong dollar. This would lead to asymmetries in the news
impact curve. Finally, it is also worth noting that the asymmetry term in the
55/56
56/56
Lebih dari sekadar dokumen.
Temukan segala yang ditawarkan Scribd, termasuk buku dan buku audio dari penerbitpenerbit terkemuka.
Batalkan kapan saja.